-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC_INVALID_HANDLE #128
Comments
I am aware of this longer standing issue, hardly reproducible in a controlled manner. Finally found a simple way to reproduce (unit tests in locktest.js) but can't be sure exactly the same cause is happening in your environment. Reading the code, the empty connectionInfo should never be possible, need to investigate that further. Can you build from source the current master and test for some time if any improvements in your environment? I can also build a Windows binary or publish a (pre)release with this fix, if needed? |
I've built from source ( From looking at the commit, the casts should already be implicit in I don't understand either how the connectionInfo gets emptied, since it is read directly from the sap nwrfc library and it looks like there should at least be empty fields.. Btw, the connectionInfo structure looks a bit different in |
Beside casts, additional exclusive locks are added in Ping() and PrepareAsync() (see Lock/UnlockMutex()) With just pushed commit c5a443b, another exclusive lock is added in ConnectionInfo(), to prevent another client possibly interfere after connection handle check, invalidating connection handle before reading connectionInfo. But even in such case, an error should be returned, not empty connectionInfo. It is hard to believe but SAP NWRFC SDK might also have a bug, will check with SDK team :) |
)-: just had another one, this time with node-rfc e69b5b9 same basic effect: connection with empty connectionInfo straight out of connection pool acquire. |
not sure whether this will help, but since we lack information: immediately before the bogus We log
Note that
connection pool options are |
After discussing with NWRFC SDK experts, two changes are implemented with f058683 The More restrictive logic is added to reset this flag, after certain error types raised on ABAP server side. Could you please build from from master with this commit included and re-test? The NWRFC SDK trace would be also helpful for analysis. The trace can be activated by setting the connection parameter |
Thanks - I've pulled, installed, prebuilt, and I'm going to test next week. Not sure about the trace yet, as it may run for a day between errors (for such cases, it would be useful if the SDK had a "rotating trace" option that only keeps the last hour of trace output - that could be copied once an error occured; for call-level tracing, we had to add our own trace variant anyway to get a useful amount of trace data). Btw, while the previous fixes did not subjectively change the number of |
sadly, the So the symptoms have changed, and the trace happened to be on at the time:
|
Checked together with NWRFC SDK experts and RFC_UNKNOWN_ERROR, coming from levels below the SDK, makes it hard to guess what could be wrong here. I found two critical
These gaps, leading to issues like #132 and #133, are now fixed in master. Client locking is also checked in detail and more unit tests added for Pool. All these fixes will be shipped with next release, as soon the #122 fixed, waiting for prebuild/prebuild#267. If waiting for These new added features might be helpful in troubleshooting and custom logging:
Hope this helps :) |
Changes are published in v1.1.0 |
Thanks, it is good to see you're active on these issues (and in touch with the SDK folks) and that there is visible progress. Though the "closing while active" issue should not affect us. I did not have as much opportunity to test last week as before and did not keep the trace on all the time but -subjectively- the number of incidents seems to have gone down (from at least once per day to less than once per day, over a week). Will update and check whether the additional info helps with logging. And having the existing fixes published prompts me to update our other servers (I kept them running on 1.0.3 for comparison, and they're still having more frequent issues, including random shutdowns). The thing that confused me about the last logs I provided was that one call failed and another, interleaved, call got the INVALID_HANDLE - was that a coincidence of two calls failing one after the other with different errors or was there some interaction between one call failing and the other ones error message? Btw, when browsing the trace files, I see that the "Sending data" entries vary between code pages 4103 and 1100 while the "Receiving data" entries vary between 4103 and unknown - is that normal (I would expect sending/receiving to stick to one known code page each)? |
I'm currently a bit confused about the extra outputs, in particular First, I noticed that some freshly acquired Slice of log for specific
|
Here is an INVALID_HANDLE with the latest versions (again, trace wasn't on, but some of the new logging options were)
given that previous traces showed code pages 4103 and 1100, is it odd that there is only 4103 here? As the connectionHandle sees some reuse, here is a log slice wrt that handle, ending in an INVALID_HANDLE directly on
|
Thanks a lot for the very valuable input.
Yes, that’s normal. No need to worry, the codepages are handled automatically between NWRFC SDK and RFC server system.
The client id comes from N-API/C wrapper and incremented with each new Client instance. After opening the connection, the client gets the connection handle. The handle can change after re-opening the connection, id stays the same. The reopen happens after the client released back to pool and acquired for new task, for example.
This could be the actually the cause of getting RFC_INVALID_HANDLE. After certain errors, the connection is automatically closed and cleaned-up by NWRFC SDK. The node-rfc was so far in this regard transparent, only updating With the latest commit 5794944, the node-rfc automatically re-opens the connection, after critical errors. This should eliminate or at least considerably reduce RFC_INVALID_HANDLE errors. If the problem persists, I would try to reproduce by using the same stack (any other components, frameworks on top of node-rfc?) and calling |
After a full day of testing 1.1.0 on two servers, that version definitely generates a lot more RFC_INVALID_HANDLEs, but almost all of them come immediately after the I do find it odd that |
Still getting a lot of RFC_INVALID_HANDLEs with e7c1fba , mostly after |
Could there be an option for not reusing connections (ie not putting the |
The reason for empty connectionInfo is still open and discussed with NWRFC SDK experts but hard to guess without reproducible example. |
You may test also the Pool without reuse of released connections 9833196. In parallel I will check if Pool functionality could be moved to C, making it less dependable on Promises implementation. |
I observed some unit test behave differently when running from the command line and via npm script, executing exactly the same command line. Sometimes, via npm script, the tests were failing because the client.id could not be referenced because the client is undefined. Isolated tests and all tests via command line were always passing and failing only when running all tests via npm script. Suspecting it could be somehow related to Promise used in Pool internal active/ready collections, I replaced them with Client instances, reducing the overhead. The npm testing issues are now gone and perhaps it could improve the situation in your environment: a6ec7ef If not, the Pool shall be moved to C level, becoming independent of Promise implementations on different platforms. |
I'm still observing
As before, the INVALID_HANDLE in getting the |
I've collected 2 somewhat longer log slices, hoping to see some connections. As before, the individual calls proceed as follows:
post The most common failure pattern is still: P1.
However, a couple of other patterns can also be observed: P2. 2020-04-24 11:51:01
{
"alive": true,
"name": "RfcLibError",
"code": 1,
"codeString": "RFC_COMMUNICATION_FAILURE",
"key": "RFC_COMMUNICATION_FAILURE",
"message": {
"message(split at line breaks)": [
"LOCATION CPIC (TCP/IP) on local host with Unicode",
"ERROR program state check for conversation 33641697",
"TIME Fri Apr 24 13:35:57 2020",
"RELEASE 753",
"COMPONENT CPIC (TCP/IP) with Unicode",
"VERSION 3",
"RC 471",
"MODULE D:/depot/bas/753_REL/src/krn/si/cpic/r3cpic.c",
"LINE 6378",
"DETAIL called function STSEND in state state=BUFFER_DATA2",
"COUNTER 8"
]
}
} P3. 2020-04-24 13:00:37
Pattern 3 repeats 2020-04-24 13:30:27, id 16, and 2020-04-24 13:30:28, id 17, 2020-04-24 13:33:03, id 19, 2020-04-24 13:33:05, id 20, all with the same Then, we get another instance of pattern 2 (2020-04-24 13:33:12, identical Back to pattern 3 for a while, then another pattern 2, this time with 3 ids that share the same One instance of pattern 3 and then we're back to pattern 1 again. Todays log has the same 3 failure patterns. Pattern 3 means that an INVALID_HANDLE error for The frequent reuse of I cannot see any obvious culprit in node-rfc (I had several theories, none of which survived further inspection;-), so either there is a non-obvious issue (perhaps via node-api, or your suspected interaction of Promise implementations) or the issue may be in the SDK. This server is build via CI and does not yet include your Promise-avoidance commit but when I log |
Hi again, I've finally managed to produce a tiny example capable of reproducing the issue without involving our project code. It is not completely automated, as the issue is timing-related and -so far- needs a human touch to trigger. The example (
(tried with node v14.1.0 on windows 10, using [email protected]) It seems important not only that the RFC call returns a non-trivial result, but that the JS thread is blocked for a while logging said result while new HTTP requests keep coming in. |
Hi @bsrdjan |
Hi @clausreinke and sorry for the longer waiting. I just did a brief test and can investigate in detail next week. Thank you very much for this example, it helps reproducing on Linux. Did not try on Windows but expect the same result. I slightly modified the test setup as follows:
$ RFC_INI=./sapnwrfc.ini node index.js > log.txt &
$ for i in {1..10}; do curl localhost:3456/doc/function; done Please find logs with RFC_INVALID_HANDLE attached, need to investigate in detail: logs.zip |
Hi again; I had another look through the So far, I've only found one potential issue along those lines. Perhaps you've covered it already but as we're stuck, I thought I'd mention it. Background: according to the programming guide, the Sap NW RFC library is thread-safe, but for communication and table handles.
|
Yay!-) Following the theory from #147, I've instrumented In many cases, at least, this does indeed seem to be the cause of trouble, because a client that previously owned a handle closes it again when it has already been reopened by another client - that other client then fails because the handle has disappeared. I suggest to keep track of whether a client has lost control of its handle
As the latter will happen in |
Hi @clausreinke, thank you for the thoughtful analysis and the patience with this issue. Based on you example, the test setup with express is in The test is based on your example, with following added:
The test with let say 20 requests, creates 5 RFC trace files, indicating 5 threads consuming the Pool. The current implementation of the Pool can't handle that safely because there is no lock/mutex to synchronise taking one connection from pool and giving it back. That must be protected. Inspecting RFC traces, I saw connection handles occurring after closing the connection and learned that handles are not necessarily unique during the session. The handle is just the pointer to the C ++ object, and after a free / delete the C-Runtime can re-use the same address again in a subsequent malloc / new. It happens very often (especially on Windows), that ever smarter memory management system “notices” the block of memory is just the right size and the block is re-used, instead of allocating a new one, causing unnecessary fragmentation. The synchronised consumption of connections fails normally in a following way. The Thread A opens connection x and does something with it. The call ends in SysFailure / CommFailure, so that the connection breaks (and is deleted again). The application does not pay attention to the return code and tries to continue and use actually broken connection. This normally results in INVALID_HANDLE. In current Pool implementation, the connection is closed when released back to Pool, as attempt to prevent reusing eventually broken connection. But even that is not possible with synchronised threads. The close in release shall be definitely removed and connections in release closed only if the total > Pool.max. The traces of the current Pool implementation show something like: The lower the frequency with which connections are opened and closed again, the greater is statistical probability that this will happen. I would therefore move the Pool to C, synchronise with existing Client mutex, in next couple of days and re-test :) |
I'm not sure I understand you correctly, in particular, how moving the connection pool from js to c would help with the multiple threads. There should be only a single thread evaluating js callbacks, the other threads should be from a libuv thread pool, accessed via napi async workers; according to docs, only the The client semaphore only works for the connection as long as the latter is open - once the connection has been closed, be it by the client or by the backend or by the rfc library, the client must not use it again, semaphore or not; in particular, the close in the client destructor (while necessary) must be guarded, not by whether the connection is currently valid, but by whether it has already been closed since this client opened it. Even if a client successfully closes its connection, it must not try to do so again in its destructor, because another client (using a different semaphore) may have opened the same connection in the meantime. |
It took more time and here the re-designed version: v2-pre branch. Please feel free to test, review and share the feedback. The most important changes are in the changelog and in documentation: direct-and-managed-clients |
The version 2.0 published, should fix the issue. |
Just checking the status here. Any updates in the meantime, any issues left ? |
Let close this one and open new issue of needed. |
Can someone provide me a mirror link for nwrfc750P ? |
There are no mirrors, SAP Support Portal is the only distribution channel for SAP NW RFC SDK https://support.sap.com/en/product/connectors/nwrfcsdk.html
|
As much as we like the library, we're having a persistent problem in using it, namely
We cannot reliably reproduce the error and as of yet, no pattern is apparent in what triggers the bug, but it happens more than once per day in light usage.
Lacking a reproducable example, we're looking for anything useful we could log to help narrow down the problem when it occurs (suggestions are welcome - currently, we log
client.id
,client.isAlive
andclient.connectionInfo
) and we've found a first pattern. It does not cover all cases but sufficiently many to be interesting:It appears that sometimes the
connectionInfo
goes missing in acquired connections (logged betweenconst client = await connectionPool.acquire()
andclient.invoke(..)
).We have a global connection pool, constructed once, via
new Pool(sapConnectionParameters, { min: 1, max: -42 });
(the minimum of 1 avoids obscuring the issue with multiple connections) and neither the pool nor thesapConnectionParameters
are changed after construction (a shallow copy is made of the latter to ensure this). For every SAP call, we acquire a connection/client, check that it and itsid
are defined, then callclient.invoke(..)
and -in its callback- we end withconnectionPool.release(client)
(in both success and failure cases).Sometimes, between
const client = await connectionPool.acquire()
andclient.invoke(..)
, theclient.connectionInfo
logs as{}
, followed by anRFC_INVALID_HANDLE
error from the.invoke
.We're stumped as to how the
connectionInfo
could be empty at that point - does this give you any hint about what could be going wrong (perhaps a rogue deconstruction)?You mentioned in #99 (comment) that
We don't use any threads ourselves and since
nodejs
is single-threaded for the JavaScript part, the only threads in play should be the worker threads (Napi::AsyncWorker
) used in thenode-rfc
implementation. Of course, multiple async calls can be interleaved.OS: Windows 10
node: node-v12.14.1-win-x64
SAP NW RFC SDK: nwrfc750P_6-70002755
node-rfc: [email protected]
The text was updated successfully, but these errors were encountered: