Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

edge crashes on Windows, memory access violation (LZO decompression?) #33

Open
aarojun opened this issue May 17, 2024 · 26 comments
Open
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@aarojun
Copy link

aarojun commented May 17, 2024

Myself and some other users are experiencing common crashes on n3n-edge 3.3.4 due to access violation during lzo1x decompression. I'm unsure how to reproduce this for further debugging.
If the decompression method is unsafe I suppose one user sending corrupt packets could make a community momentarily unusable.

Issue ntop/n2n#1165 could be related.
We may test if using an older versions of n2n (pre 3.1.1) for edges on communities changes this behavior.

WinDbg !analyze -v result:

Details

*******************************************************************************
*                                                                             *
*                        Exception Analysis                                   *
*                                                                             *
*******************************************************************************


KEY_VALUES_STRING: 1

    Key  : AV.Fault
    Value: Read

    Key  : Analysis.Elapsed.mSec
    Value: 479

    Key  : Analysis.IO.Other.Mb
    Value: 0

    Key  : Analysis.IO.Read.Mb
    Value: 0

    Key  : Analysis.IO.Write.Mb
    Value: 0

    Key  : Analysis.Init.Elapsed.mSec
    Value: 7206

    Key  : Analysis.Memory.CommitPeak.Mb
    Value: 81

    Key  : Failure.Bucket
    Value: INVALID_POINTER_READ_c0000005_n3n-edge.exe!Unknown

    Key  : Failure.Hash
    Value: {c9829084-8ed0-c0cc-a3a7-5d8477630a4e}

    Key  : Timeline.OS.Boot.DeltaSec
    Value: 25183

    Key  : Timeline.Process.Start.DeltaSec
    Value: 520

    Key  : WER.OS.Branch
    Value: ni_release

    Key  : WER.OS.Version
    Value: 10.0.22621.1

    Key  : WER.Process.Version
    Value: 3.0.0.10


FILE_IN_CAB:  n3n-edge.exe.4000.dmp

NTGLOBALFLAG:  0

APPLICATION_VERIFIER_FLAGS:  0

CONTEXT:  (.ecxr)
rax=000000820b5fd6f8 rbx=0000000000000069 rcx=0000000000000003
rdx=000000820b5f2828 rsi=000000820b5fcefd rdi=000000820b5fd700
rip=00007ff64e824c48 rsp=000000820b5fc728 rbp=000000820b5fc8fc
 r8=000000820b5fd6d0  r9=000000820b5fc798 r10=000000820b5fced0
r11=000000820b5f2828 r12=0000000000000000 r13=0000000000000008
r14=0000000066477365 r15=000000820b5fc8f0
iopl=0         nv up ei pl zr ac po nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010256
n3n_edge+0x34c48:
00007ff6`4e824c48 488b2a          mov     rbp,qword ptr [rdx] ds:00000082`0b5f2828=????????????????
Resetting default scope

EXCEPTION_RECORD:  (.exr -1)
ExceptionAddress: 00007ff64e824c48 (n3n_edge+0x0000000000034c48)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000000
NumberParameters: 2
   Parameter[0]: 0000000000000000
   Parameter[1]: 000000820b5f2828
Attempt to read from address 000000820b5f2828

PROCESS_NAME:  n3n-edge.exe

READ_ADDRESS:  000000820b5f2828 

ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%p referenced memory at 0x%p. The memory could not be %s.

EXCEPTION_CODE_STR:  c0000005

EXCEPTION_PARAMETER1:  0000000000000000

EXCEPTION_PARAMETER2:  000000820b5f2828

IP_ON_STACK: 
+0
00000082`0b5fe53e 3b903f90b335    cmp     edx,dword ptr [rax+35B3903Fh]

FRAME_ONE_INVALID: 1

STACK_TEXT:  
00000082`0b5fc728 00000082`0b5fe53e     : 00000000`00000069 00000082`0b5fd6d0 00000082`0b5fc8fc 00000243`38c67ab0 : n3n_edge+0x34c48
00000082`0b5fc730 00000000`00000069     : 00000082`0b5fd6d0 00000082`0b5fc8fc 00000243`38c67ab0 00000082`0b5fd6d0 : 0x00000082`0b5fe53e
00000082`0b5fc738 00000082`0b5fd6d0     : 00000082`0b5fc8fc 00000243`38c67ab0 00000082`0b5fd6d0 00007ff6`4e81a104 : 0x69
00000082`0b5fc740 00000082`0b5fc8fc     : 00000243`38c67ab0 00000082`0b5fd6d0 00007ff6`4e81a104 00007ff6`4e841f3f : 0x00000082`0b5fd6d0
00000082`0b5fc748 00000243`38c67ab0     : 00000082`0b5fd6d0 00007ff6`4e81a104 00007ff6`4e841f3f 00000082`0b5fe53e : 0x00000082`0b5fc8fc
00000082`0b5fc750 00000082`0b5fd6d0     : 00007ff6`4e81a104 00007ff6`4e841f3f 00000082`0b5fe53e 00000000`00000079 : 0x00000243`38c67ab0
00000082`0b5fc758 00007ff6`4e81a104     : 00007ff6`4e841f3f 00000082`0b5fe53e 00000000`00000079 00000000`00000002 : 0x00000082`0b5fd6d0
00000082`0b5fc760 00007ff6`4e841f3f     : 00000082`0b5fe53e 00000000`00000079 00000000`00000002 00000000`00000000 : n3n_edge+0x2a104
00000082`0b5fc768 00000082`0b5fe53e     : 00000000`00000079 00000000`00000002 00000000`00000000 00000243`38c67ab0 : n3n_edge+0x51f3f
00000082`0b5fc770 00000000`00000079     : 00000000`00000002 00000000`00000000 00000243`38c67ab0 00000082`0b5fced0 : 0x00000082`0b5fe53e
00000082`0b5fc778 00000000`00000002     : 00000000`00000000 00000243`38c67ab0 00000082`0b5fced0 00000000`00000000 : 0x79
00000082`0b5fc780 00000000`00000000     : 00000243`38c67ab0 00000082`0b5fced0 00000000`00000000 00000082`0b5fc8f0 : 0x2


STACK_COMMAND:  ~0s; .ecxr ; kb

SYMBOL_NAME:  n3n_edge+34c48

MODULE_NAME: n3n_edge

IMAGE_NAME:  n3n-edge.exe

FAILURE_BUCKET_ID:  INVALID_POINTER_READ_c0000005_n3n-edge.exe!Unknown

OS_VERSION:  10.0.22621.1

BUILDLAB_STR:  ni_release

OSPLATFORM_TYPE:  x64

OSNAME:  Windows 10

FAILURE_ID_HASH:  {c9829084-8ed0-c0cc-a3a7-5d8477630a4e}

Followup:     MachineOwner
---------

@hamishcoleman hamishcoleman added bug Something isn't working help wanted Extra attention is needed labels May 17, 2024
@hamishcoleman
Copy link
Contributor

What leads you to believe this is a problem in the LZO? Do you have verbose logs from a crash?

@aarojun
Copy link
Author

aarojun commented May 17, 2024

Sorry, the above was a poor example. I had a string of crashes where the failure bucket referenced lzo1x decompress. Sometimes this happened a moment after a peer joined a community or formed a connection.
I'll try to reproduce with verbose logs. The core problem may still be something else.

@aarojun aarojun changed the title edge crashes on Windows, lzo decompression memory access violation edge crashes on Windows, memory access violation (lzo decompression?) May 17, 2024
@aarojun aarojun changed the title edge crashes on Windows, memory access violation (lzo decompression?) edge crashes on Windows, memory access violation (LZO decompression?) May 17, 2024
@hamishcoleman
Copy link
Contributor

Thanks! More data will certainly help dig into this.

Are you using the binaries from the release or compiling your own? (I want to check if the reported addresses should match up with the symbols we have)

@NiKola-UE
Copy link

@aarojun, it's good if you managed to run any N3N application on Windows at all because I just didn't. I downloaded both, unblocked them, but nothing happens. Antimalwares and antiviruses didn't come up (I currently use Avira and Malwarebytes Anti-Malware), even Windows Defender didn't react as something suspicious, but there's just nothing, even when I run them from the administrator. Why, I really don't known.

@hamishcoleman
Copy link
Contributor

@NiKola-UE we test n3n on Windows regularly to prove that it is definitely working. You should create a new ticket and describe your situation in that new ticket so we can work on figuring out what is happening for you.

@NiKola-UE
Copy link

OK, I'll try again. I'm using Windows 11, but I don't know if that has anything to do with it. If there are problems again, I will open a new issue for it, although I have already said everything here. Maybe I missed something after all...

@aarojun
Copy link
Author

aarojun commented May 19, 2024

@NiKola-UE if it wasn't clear it's not drop-in replacement to n2n: the edge.exe must be called with "edge.exe start" instead of just double clicking, config files are located in user\n3n\ and the command syntax has changed from n2n. see the docs if these are causing issues. For me it runs generally without issue on Windows 11, just with different syntax (which is often more readable) compared to n2n. But I recommend starting a new issue.

For the memory access violation crashes I had the following crash today but don't have more details as I didn't have the debugger running and couldn't reproduce it on a short timeframe.

The vast majority of these crashes have referenced !lzo1x_decompress which is why I brought it up in the opening comment.

Details

*******************************************************************************
*                                                                             *
*                        Exception Analysis                                   *
*                                                                             *
*******************************************************************************


KEY_VALUES_STRING: 1

    Key  : AV.Fault
    Value: Write

    Key  : Analysis.CPU.mSec
    Value: 359

    Key  : Analysis.Elapsed.mSec
    Value: 527

    Key  : Analysis.IO.Other.Mb
    Value: 0

    Key  : Analysis.IO.Read.Mb
    Value: 3

    Key  : Analysis.IO.Write.Mb
    Value: 0

    Key  : Analysis.Init.CPU.mSec
    Value: 46

    Key  : Analysis.Init.Elapsed.mSec
    Value: 7986

    Key  : Analysis.Memory.CommitPeak.Mb
    Value: 79

    Key  : Failure.Bucket
    Value: INVALID_POINTER_WRITE_c0000005_n3n-edge-v3.3.4.exe!lzo1x_decompress

    Key  : Failure.Hash
    Value: {c22f5744-d13e-e417-97fa-1dac132180ce}

    Key  : Timeline.OS.Boot.DeltaSec
    Value: 134897

    Key  : Timeline.Process.Start.DeltaSec
    Value: 194

    Key  : WER.OS.Branch
    Value: ni_release

    Key  : WER.OS.Version
    Value: 10.0.22621.1


FILE_IN_CAB:  n3n-edge-v3.3.4.exe.24020.dmp

NTGLOBALFLAG:  0

APPLICATION_VERIFIER_FLAGS:  0

CONTEXT:  (.ecxr)
rax=000000c834200006 rbx=0000000000000074 rcx=0000000000007450
rdx=000000c8341fe18c rsi=0000000000000000 rdi=000000c83420656e
rip=00007ff682c74d43 rsp=000000c8341fc958 rbp=0000000000000e8a
 r8=000000c8341fd900  r9=000000c8341fc9c8 r10=000000c8341fd100
r11=000000c8341ff07c r12=0000000000007458 r13=00000000000007a0
r14=0000000066491ffa r15=000000c8341fcb20
iopl=0         nv up ei pl nz na pe nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010200
n3n_edge_v3_3_4!lzo1x_decompress+0x1e3:
00007ff6`82c74d43 488970f8        mov     qword ptr [rax-8],rsi ds:000000c8`341ffffe=????????????????
Resetting default scope

EXCEPTION_RECORD:  (.exr -1)
ExceptionAddress: 00007ff682c74d43 (n3n_edge_v3_3_4!lzo1x_decompress+0x00000000000001e3)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000000
NumberParameters: 2
   Parameter[0]: 0000000000000001
   Parameter[1]: 000000c834200000
Attempt to write to address 000000c834200000

PROCESS_NAME:  n3n-edge-v3.3.4.exe

WRITE_ADDRESS:  000000c834200000 

ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%p referenced memory at 0x%p. The memory could not be %s.

EXCEPTION_CODE_STR:  c0000005

EXCEPTION_PARAMETER1:  0000000000000001

EXCEPTION_PARAMETER2:  000000c834200000

STACK_TEXT:  
000000c8`341fc958 00007ff6`82c6a104     : 00007ff6`82c91f3f 000000c8`341fe76e 00000000`00000084 00000000`00000002 : n3n_edge_v3_3_4!lzo1x_decompress+0x1e3
000000c8`341fc990 00007ff6`82c49bbe     : 01daa96b`cf808048 0000023a`ea860468 00000000`00000000 00000000`00000000 : n3n_edge_v3_3_4!transop_decode_lzo+0x34
000000c8`341fc9e0 00000000`00000000     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : n3n_edge_v3_3_4!process_udp+0xa6e


STACK_COMMAND:  ~0s; .ecxr ; kb

FAULTING_SOURCE_LINE:  /home/runner/work/n3n/n3n/src/minilzo.c

FAULTING_SOURCE_FILE:  /home/runner/work/n3n/n3n/src/minilzo.c

FAULTING_SOURCE_LINE_NUMBER:  5468

FAULTING_SOURCE_CODE:  
No source found for '/home/runner/work/n3n/n3n/src/minilzo.c'


SYMBOL_NAME:  n3n_edge_v3_3_4!lzo1x_decompress+1e3

MODULE_NAME: n3n_edge_v3_3_4

IMAGE_NAME:  n3n-edge-v3.3.4.exe

FAILURE_BUCKET_ID:  INVALID_POINTER_WRITE_c0000005_n3n-edge-v3.3.4.exe!lzo1x_decompress

OS_VERSION:  10.0.22621.1

BUILDLAB_STR:  ni_release

OSPLATFORM_TYPE:  x64

OSNAME:  Windows 10

FAILURE_ID_HASH:  {c22f5744-d13e-e417-97fa-1dac132180ce}

Followup:     MachineOwner
---------

I believe the build here is the Windows x64 binary from 51eb3d7 https://github.com/hamishcoleman/n3n/actions/runs/9088706213/artifacts/1503386924

Not sure why I had that on the edge, I'll default to the current release binary for future testing. But this may go on hold for me depending on how much use n3n is seeing

@hamishcoleman
Copy link
Contributor

I suspect that the memory access issue is not in the lzo library, but in the pointers that have been handed to the lzo.

  • Are you able to collect the verbose output from the edge at the time of a crash?
  • Is there anything in particular that you are doing at the time?
  • Can you paste your (sanitised) config file?

@eebssk1
Copy link

eebssk1 commented May 21, 2024

releated ntop/n2n#1165

@hamishcoleman
Copy link
Contributor

Hi @eebssk1 , do you have any logs or dumps or any way to reproduce this issue that you can share?

@eebssk1
Copy link

eebssk1 commented May 21, 2024

Hi @eebssk1 , do you have any logs or dumps or any way to reproduce this issue that you can share?

I currently does not use n2n much. I have set mtu to 1280. And I'm not getting the problem for now. It seems happen more often if network is unstable or game is bursting a lot small packets. I'll share more information next time this happens.

[Anyway I added a extra check for corrupted data so it logs the incident and silently continue, preventing crash. As I said I didn't see the incident log yet]

@hamishcoleman
Copy link
Contributor

If your extra check is able to prevent the crash, are you able to share what you added?

@eebssk1
Copy link

eebssk1 commented May 21, 2024

If your extra check is able to prevent the crash, are you able to share what you added?

Nah it's just a quick dirty hack to prevent crash by checking if the struct data are garbage ntop/n2n#1165 (comment)_

According to that stacktrace and memory dump, the function crashes at the end, so I add a check at beginning to skip the entire function if data are corrupt.

@hamishcoleman
Copy link
Contributor

That is unfortunate, since I cannot see any specific buffer overflow in the code, nor am I able to reproduce any similar crash with heavy testing on Linux.

So, without more data or a reproducible case, I doubt we can make any progress here.

@eebssk1
Copy link

eebssk1 commented May 21, 2024

That is unfortunate, since I cannot see any specific buffer overflow in the code, nor am I able to reproduce any similar crash with heavy testing on Linux.

So, without more data or a reproducible case, I doubt we can make any progress here.

I never get any problems on linux, seems only happen on windows. Anyway when it happens I'll try to collect data as much as possible.

@hamishcoleman
Copy link
Contributor

I will be trying to setup a better windows test environment, but it is a lot more difficult than testing on Linux

@NiKola-UE
Copy link

I would just like to complete what I already mentioned here (it is not necessary for a new issue): In my case, it is more and more certain that antiviruses interfere, block and make it impossible to start N3N and similar apps tools, not allowing them to work or even to install them at all, falsely recognizing them as something potentially dangerous, suspicious, unwanted, fraudulent, harmful, etc. The instructions, tutorials and guides should also indicate this, as well as explanations of how it should be properly set up and configured for us without advanced technical knowledge who know nothing or very little about programming; which is complex and demanding, but it will still be helpful and useful. Thank you.

@eebssk1
Copy link

eebssk1 commented Jun 5, 2024

In my case, I never have any problem with zstd compression.
If I switch to lzo then it crash in ~30 secs even without active connections.
I'm currently still stuck with n2n(Though I added addition work to make it work with mingw and openssl 3.2).

@hamishcoleman
Copy link
Contributor

@NiKola-UE It is very hard to include any clear statements about antivirus products as - pretty much by design - they are unclear on exactly how to bypass them. I would hope that there are clear log entries from any antivirus software making it obvious that they have blocked something.

@hamishcoleman
Copy link
Contributor

@eebssk1 Can you outline what is keeping you on n2n? Perhaps there is something we can merge into n3n to make a migration easier?

@eebssk1
Copy link

eebssk1 commented Jun 5, 2024

@eebssk1 Can you outline what is keeping you on n2n? Perhaps there is something we can merge into n3n to make a migration easier?

The windows GUI maker @happyntec does not have any interests in supporting n3n.
So then I made my own GUI but it is still designed according to the original n2n to ensure compability with many of my friends.
I may lately add support for n3n to it but It's unlikely to happen in near future since it's now stable enough for us to use currently.

However if you may make n3n(in new branch) a drop-in replacement that comes with only bugfix but not API/CLI breakage then it's really appreciated.

@eebssk1
Copy link

eebssk1 commented Jun 5, 2024

I would just like to complete what I already mentioned here (it is not necessary for a new issue): In my case, it is more and more certain that antiviruses interfere, block and make it impossible to start N3N and similar apps tools, not allowing them to work or even to install them at all, falsely recognizing them as something potentially dangerous, suspicious, unwanted, fraudulent, harmful, etc. The instructions, tutorials and guides should also indicate this, as well as explanations of how it should be properly set up and configured for us without advanced technical knowledge who know nothing or very little about programming; which is complex and demanding, but it will still be helpful and useful. Thank you.

Do you have any explicit source that indicate AV/FW blocks n2n?
Mine is compiled by myself, UPX compressed and then self signed(as well as my GUI).
None of my friends and my AV blocks it currently.

@hamishcoleman
Copy link
Contributor

n3n is deliberately cleaning up the API and CLI, so there will not be the compatibility you are looking for.

I'd say that the n2n is the unstable one - the number of stability bugs that have been found and fixed in n3n is only increasing.

It is a pity that none of these GUI systems were contributed to the repo, otherwise it would have been possible to forward port them along with the other n3n changes.

@NiKola-UE
Copy link

To conclude: I used Avast early, whose adware and false positive alarming is really annoying, and now I use Avira, which, at least from the latest version, automatically blocks everything that looks like that, which I will have to deal with in detail myself. Maybe antiviruses and similar programs can sometimes cause these and similar chrashes, which is what this issue primarily deals with, but I don't know that...

@hamishcoleman
Copy link
Contributor

@NiKola-UE It does sound like we should have a different ticket to track reports of issues with antivirus software - perhaps you could create one and add the logs and event messages generated by your antivirus software so that they can be examined and checked if there are steps that can be taken to help avoid triggering them

@NiKola-UE
Copy link

NiKola-UE commented Jun 9, 2024

You're probably right. I will do so and record everything nicely when I have more time, but I think it's best to open a separate issue that will deal with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants