RFC_INVALID_HANDLE #128

clausreinke · 2020-03-16T17:30:19Z

As much as we like the library, we're having a persistent problem in using it, namely

{
  "name": "RfcLibError",
  "code": 13,
  "key": "RFC_INVALID_HANDLE",
  "message": "An invalid handle was passed to the API call"   
}

We cannot reliably reproduce the error and as of yet, no pattern is apparent in what triggers the bug, but it happens more than once per day in light usage.

Lacking a reproducable example, we're looking for anything useful we could log to help narrow down the problem when it occurs (suggestions are welcome - currently, we log client.id, client.isAlive and client.connectionInfo) and we've found a first pattern. It does not cover all cases but sufficiently many to be interesting:

It appears that sometimes the connectionInfo goes missing in acquired connections (logged between const client = await connectionPool.acquire() and client.invoke(..)).

We have a global connection pool, constructed once, via new Pool(sapConnectionParameters, { min: 1, max: -42 }); (the minimum of 1 avoids obscuring the issue with multiple connections) and neither the pool nor the sapConnectionParameters are changed after construction (a shallow copy is made of the latter to ensure this). For every SAP call, we acquire a connection/client, check that it and its id are defined, then call client.invoke(..) and -in its callback- we end with connectionPool.release(client) (in both success and failure cases).

Sometimes, between const client = await connectionPool.acquire() and client.invoke(..), the client.connectionInfo logs as {}, followed by an RFC_INVALID_HANDLE error from the .invoke.

We're stumped as to how the connectionInfo could be empty at that point - does this give you any hint about what could be going wrong (perhaps a rogue deconstruction)?

You mentioned in #99 (comment) that

RFC_INVALID_HANDLE is usually caused by sharing the same node-rfc client connection instance among more than one thread.

We don't use any threads ourselves and since nodejs is single-threaded for the JavaScript part, the only threads in play should be the worker threads (Napi::AsyncWorker) used in the node-rfc implementation. Of course, multiple async calls can be interleaved.

OS: Windows 10
node: node-v12.14.1-win-x64
SAP NW RFC SDK: nwrfc750P_6-70002755
node-rfc: [email protected]

The text was updated successfully, but these errors were encountered:

bsrdjan · 2020-03-19T13:29:08Z

I am aware of this longer standing issue, hardly reproducible in a controlled manner.

Finally found a simple way to reproduce (unit tests in locktest.js) but can't be sure exactly the same cause is happening in your environment.

Reading the code, the empty connectionInfo should never be possible, need to investigate that further.

Can you build from source the current master and test for some time if any improvements in your environment? I can also build a Windows binary or publish a (pre)release with this fix, if needed?

clausreinke · 2020-03-20T16:11:00Z

I've built from source (npm install, then npm prebuild) and will test next week.

From looking at the commit, the casts should already be implicit in mallocU/memsetU, so it is about an additional error check (always useful) and a lock in PrepareAsync, right?

I don't understand either how the connectionInfo gets emptied, since it is read directly from the sap nwrfc library and it looks like there should at least be empty fields..

Btw, the connectionInfo structure looks a bit different in nwrfc750P_6-70002755, _RFC_ATTRIBUTES in sapnwrfc.h: compare fields user, cpicConvId, progName, and partnerBytesPerChar. There are also some new fields.

bsrdjan · 2020-03-20T17:15:40Z

Beside casts, additional exclusive locks are added in Ping() and PrepareAsync() (see Lock/UnlockMutex())

With just pushed commit c5a443b, another exclusive lock is added in ConnectionInfo(), to prevent another client possibly interfere after connection handle check, invalidating connection handle before reading connectionInfo. But even in such case, an error should be returned, not empty connectionInfo.

It is hard to believe but SAP NWRFC SDK might also have a bug, will check with SDK team :)

clausreinke · 2020-03-25T15:40:20Z

)-: just had another one, this time with node-rfc e69b5b9

same basic effect: connection with empty connectionInfo straight out of connection pool acquire.

clausreinke · 2020-03-25T16:32:01Z

not sure whether this will help, but since we lack information: immediately before the bogus acquire result, there were two near simultaneous calls.

We log client.id, client.isAlive and client.connectionInfo, immediately after acquire, immediately before invoke, and immediately after release (both success and error case)

16:32:51 post acquire ----> 64 true connectionInfo
16:32:52 pre invoke ----> 64 true connectionInfo
16:32:52 post acquire ----> 65 true connectionInfo
16:32:52 pre invoke ----> 65 true connectionInfo
16:32:52 post release (success) ----> 65 false {}
16:32:53 post release (success) ----> 64 false {}

16:32:53 post acquire ----> 66 true {}
16:32:53 pre invoke ----> 66 true {}
16:32:53 post release (error) ----> 66 false {}
16:32:53 communication/network problem {
  "name": "RfcLibError",
  "code": 13,
  "key": "RFC_INVALID_HANDLE",
  "message": "An invalid handle was passed to the API call"
}

Note that client.connectionInfo always reports as empty immediately after release. Then it reports as empty immediately after acquire and the error is raised in invoke, leading to release and error log in the error case.

connectionInfo usually includes

{
  ..,
  "codepage": "4103",
  "partnerCodepage": "4103",
  "rfcRole": "C",
  "type": "E",
  "partnerType": "3",
  "rel": "753",
  "partnerRel": "740",
  "kernelRel": "749",
  "cpicConvId": "57521748",
  "progName": "SAPLSYST",
  "partnerBytesPerChar": "2",
  "partnerSystemCodepage": "4103",
  ..
}

connection pool options are { "min": 1, "max": -42 }

bsrdjan · 2020-04-02T14:38:28Z

After discussing with NWRFC SDK experts, two changes are implemented with f058683

The IsConnectionHandleValid check may return true, even after connection closed, if the connection handle in internal NWRFC SDK cache still looks valid. That check is in ConnectionInfo therefore replaced by simple connection alive flag check.

More restrictive logic is added to reset this flag, after certain error types raised on ABAP server side.

Could you please build from from master with this commit included and re-test?

The NWRFC SDK trace would be also helpful for analysis. The trace can be activated by setting the connection parameter trace to 3. If the resulting trace file is not too big and contains no confidential data, please attach, or we find another way to share.

clausreinke · 2020-04-03T13:59:06Z

Thanks - I've pulled, installed, prebuilt, and I'm going to test next week. Not sure about the trace yet, as it may run for a day between errors (for such cases, it would be useful if the SDK had a "rotating trace" option that only keeps the last hour of trace output - that could be copied once an error occured; for call-level tracing, we had to add our own trace variant anyway to get a useful amount of trace data).

Btw, while the previous fixes did not subjectively change the number of INVALID_HANDLEs, they did change their symptoms: all occurrences this week were of the kind "empty connectionInfo after acquire" (previously, there were other variations mixed in). So you might well be on the right track.

clausreinke · 2020-04-06T10:07:15Z

sadly, the INVALID_HANDLE reappeared quickly, though apparently as a side-effect of an "Unknown error in lower level library" (the latter had previously happened at a rate of about 1/week, at least once we were past the buggy SDK patch levels).

So the symptoms have changed, and the trace happened to be on at the time:

logging (note the interleaving and apparent interference of client ids 30 and 31, as well as the error info in place of connectionInfo):

18:16:52 post acquire----> 30 true {
  "dest": "",
  "host": "..",
  "partnerHost": "..",
  "sysNumber": "00",
  "sysId": "..",
  "client": "..",
  "user": "..",
  "language": "D",
  "trace": "3",
  "isoLanguage": "DE",
  "codepage": "4103",
  "partnerCodepage": "4103",
  "rfcRole": "C",
  "type": "E",
  "partnerType": "3",
  "rel": "753",
  "partnerRel": "740",
  "kernelRel": "749",
  "cpicConvId": "23016689",
  "progName": "SAPLSYST",
  "partnerBytesPerChar": "2",
  "partnerSystemCodepage": "4103",
  "partnerIP": "..",
  "partnerIPv6": ".."
}
18:16:52 .. called // invoke with client.id 30 happens here
// here we acquire client.id 31, claiming to be alive but with an error instead of connectionInfo
18:16:53 post acquire----> 31 true {
  "name": "RfcLibError",
  "code": 13,
  "key": "RFC_INVALID_HANDLE",
  "message": "An invalid handle was passed to the API call"
}
// here comes the callback for the failed invoke with client.id 30
18:16:53 post release (error) ----> 30 false {}
18:16:53 {
  "name": "RfcLibError",
  "code": 28,
  "key": "",
  "message": "Unknown error in lower level library"
}

trace (three seem to have been active at the time in question, one for each connection and one for admin/cleanup?) - note: both RfcOpenConnection return RFC_OK (in spite of the RfcLibError in connectionInfo above) and there is only one call to RfcGetConnectionAttributes, not two

rfc35832_30048.trc - client.id 30

2020-04-04 18:16:52.630909 [30048] >> RfcOpenConnection
>> RfcOptions
	Partner Char Size: 1
	Client: ..
	User: ..
	Alias: 
	Passwd: (- secret -)
	Language: D
	MySAPSSO2: 
	GetSSO2: 0
	X.509 Certificate: 
	ExtidData: 
	ExtidType: 
	SysNr: 00
	ASHost: ..
	ASService: 
	GWHost: 
	GWService: sapgw00
	R3Name: 
	MSHost: 
	MSService: 
	WSHost: 
	WSPort: 
	TCPHost: 
	TCPPort: 
	TLS Cryptolib: 
	TLS Client PSE: 
	TLS Server PSE: 
	TLS Trust All: Off
	Proxy Host: 
	Proxy Port: 
	Proxy User: 
	ProgramID: 
	SAPRouter: 
	SNC Mode: 0
	SNC QOP: 0
	SNC Lib: 
	SNC Myname: 
	SNC Partnername: 
	Codepage: 1100
	Trace: 3
	LogonCheck: 1
	Compression: 1
	Serialization: 0
	Ignore Char Conv Errors: 0
	Substitution Char: 35
	UseRoundtripOptimization: 0
	SO Keepalive: 0
	CPIC Keepalive: 0
	Delta Manager: 1
<< RfcOptions

2020-04-04 18:16:52.631194 [30048] 	>> RfcOptions::determineType
determined type RFC_CLIENT
2020-04-04 18:16:52.631276 [30048] 	<< RfcOptions::determineType returned 0

2020-04-04 18:16:52.631325 [30048] 	>> RfcConnection::openNew
2020-04-04 18:16:52.631381 [30048] 	>> RfcConnection::connect
2020-04-04 18:16:52.631426 [30048] 	>> CpicConnection::connectAsClient
2020-04-04 18:16:52.637434 [30048] 	<< CpicConnection::connectAsClient returned 0

>> HEXDUMP
2020-04-04 18:16:52.637600 [30048] Sending data for handle 2138511726400, convID 23016689, codepage 1100
..
<< HEXDUMP

>> HEXDUMP
2020-04-04 18:16:52.642371 [30048] Receiving data for handle 2138511726400, convID 23016689, codepage unknown
..
<< HEXDUMP

RfcConnection handle 2138511726400 established
2020-04-04 18:16:52.642790 [30048] 	<< RfcConnection::connect returned 0

>> RfcConnection name="none"
Handle = 2138511726400 ConvID = 23016689
>> RfcControl name=".."
	Program Name: SAPLSYST                                
	Codepage: 4103		Communication Codepage: 1100
	SysID: ..		Target: ..		Mandant: ..		SysNr: 00		Language: D		User: ..
	Conversation ID: 23016689	Role: C753
	Partner Rel: 740 		Partner Kernel Rel: 749 		Partner IP: ..		Partner IPv6: ..		Protocol version: 3595
	Exclusive: false		Get SSO2: false		Use basXML: 0
<< RfcControl

>> CpicConnection
	CPIC Handle: 0
	Gateway Host: 		Gateway Service: 		AS Host: 		AS Service: 		Program ID: 
	Protocol: I
	Conversation ID: 23016689
	Last IO Time: Sat Apr 04 18:16:52 2020

	Data Received: 3		Status Received: 2		All Received: 1
	Connected: true		Deallocated: false		Accepted: 0
	Server Mode: false		Registered Mode: false		Just Accepted: 0
<< CpicConnection

	Asynchron: 0
	Has Error: 0
<< RfcConnection

2020-04-04 18:16:52.643036 [30048] 	<< RfcConnection::openNew returned 0

2020-04-04 18:16:52.643065 [30048] << RfcOpenConnection returned RFC_OK

2020-04-04 18:16:52.762744 [30048] >> RfcInvoke
Invoking .. on handle 2138511726400

>> RfcFunction name=".."
..
<< RfcFunction

2020-04-04 18:16:52.764140 [30048] 	>> RfcFunction::rfcCallReceive
>> HEXDUMP
2020-04-04 18:16:52.765782 [30048] Sending data for handle 2138511726400, convID 23016689, codepage 4103
..
<< HEXDUMP

2020-04-04 18:16:53.291129 [30048] >> Exception occurred
	[Handle: 2138511726400	ConvID: 0	SysID: ..	SysNr: 00	Target: ..	User: ..	Mandant: ..	Language: D	Codepage: 4103]
	Unknown error during RFC call

2020-04-04 18:16:53.291211 [30048] 	<< RfcFunction::rfcCallReceive returned 15

2020-04-04 18:16:53.294320 [30048] << RfcInvoke returned RFC_UNKNOWN_ERROR
	RFC_ERROR_INFO.message: Unknown error in lower level library

rfc35832_30680.trc - client.id 31

2020-04-04 18:16:53.249346 [30680] >> RfcOpenConnection
>> RfcOptions
	Partner Char Size: 1
	Client: ..
	User: ..
	Alias: 
	Passwd: (- secret -)
	Language: D
	MySAPSSO2: 
	GetSSO2: 0
	X.509 Certificate: 
	ExtidData: 
	ExtidType: 
	SysNr: 00
	ASHost: ..
	ASService: 
	GWHost: 
	GWService: sapgw00
	R3Name: 
	MSHost: 
	MSService: 
	WSHost: 
	WSPort: 
	TCPHost: 
	TCPPort: 
	TLS Cryptolib: 
	TLS Client PSE: 
	TLS Server PSE: 
	TLS Trust All: Off
	Proxy Host: 
	Proxy Port: 
	Proxy User: 
	ProgramID: 
	SAPRouter: 
	SNC Mode: 0
	SNC QOP: 0
	SNC Lib: 
	SNC Myname: 
	SNC Partnername: 
	Codepage: 1100
	Trace: 3
	LogonCheck: 1
	Compression: 1
	Serialization: 0
	Ignore Char Conv Errors: 0
	Substitution Char: 35
	UseRoundtripOptimization: 0
	SO Keepalive: 0
	CPIC Keepalive: 0
	Delta Manager: 1
<< RfcOptions

2020-04-04 18:16:53.249535 [30680] 	>> RfcOptions::determineType
determined type RFC_CLIENT
2020-04-04 18:16:53.249584 [30680] 	<< RfcOptions::determineType returned 0

2020-04-04 18:16:53.249613 [30680] 	>> RfcConnection::openNew
2020-04-04 18:16:53.249667 [30680] 	>> RfcConnection::connect
2020-04-04 18:16:53.249695 [30680] 	>> CpicConnection::connectAsClient
2020-04-04 18:16:53.251574 [30680] 	<< CpicConnection::connectAsClient returned 0

>> HEXDUMP
2020-04-04 18:16:53.251665 [30680] Sending data for handle 2138513959152, convID 23018303, codepage 1100
..
<< HEXDUMP

>> HEXDUMP
2020-04-04 18:16:53.256384 [30680] Receiving data for handle 2138513959152, convID 23018303, codepage unknown
..
<< HEXDUMP

RfcConnection handle 2138513959152 established
2020-04-04 18:16:53.257160 [30680] 	<< RfcConnection::connect returned 0

>> RfcConnection name="none"
Handle = 2138513959152 ConvID = 23018303
>> RfcControl name=".."
	Program Name: SAPLSYST                                
	Codepage: 4103		Communication Codepage: 1100
	SysID: ..		Target: ..		Mandant: ..		SysNr: 00		Language: D		User: ..
	Conversation ID: 23018303	Role: C753
	Partner Rel: 740 		Partner Kernel Rel: 749 		Partner IP: ..		Partner IPv6: ..		Protocol version: 3595
	Exclusive: false		Get SSO2: false		Use basXML: 0
<< RfcControl

>> CpicConnection
	CPIC Handle: 0
	Gateway Host: 		Gateway Service: 		AS Host: 		AS Service: 		Program ID: 
	Protocol: I
	Conversation ID: 23018303
	Last IO Time: Sat Apr 04 18:16:53 2020

	Data Received: 3		Status Received: 2		All Received: 1
	Connected: true		Deallocated: false		Accepted: 0
	Server Mode: false		Registered Mode: false		Just Accepted: 0
<< CpicConnection

	Asynchron: 0
	Has Error: 0
<< RfcConnection

2020-04-04 18:16:53.257545 [30680] 	<< RfcConnection::openNew returned 0

2020-04-04 18:16:53.257588 [30680] << RfcOpenConnection returned RFC_OK

rfc35832_12896.trc - admin/cleanup?

2020-04-04 18:16:52.643281 [12896] >> RfcGetConnectionAttributes
2020-04-04 18:16:52.643328 [12896] << RfcGetConnectionAttributes returned RFC_OK

2020-04-04 18:16:53.271555 [12896] >> RfcCloseConnection
2020-04-04 18:16:53.271647 [12896] 	>> RfcConnection::closeHandle
>> RfcConnection name="none"
Handle = 2138511726400 ConvID = 23016689
>> RfcControl name=".."
	Program Name: SAPLSYST                                
	Codepage: 4103		Communication Codepage: 1100
	SysID: ..		Target: ..		Mandant: ..		SysNr: 00		Language: D		User: ..
	Conversation ID: 23016689	Role: C753
	Partner Rel: 740 		Partner Kernel Rel: 749 		Partner IP: ..		Partner IPv6: ..		Protocol version: 3595
	Exclusive: false		Get SSO2: false		Use basXML: 0
<< RfcControl

>> CpicConnection
	CPIC Handle: 0
	Gateway Host: 		Gateway Service: 		AS Host: 		AS Service: 		Program ID: 
	Protocol: I
	Conversation ID: 23016689
	Last IO Time: Sat Apr 04 18:16:52 2020

	Data Received: 0		Status Received: 0		All Received: 1
	Connected: true		Deallocated: false		Accepted: 0
	Server Mode: false		Registered Mode: false		Just Accepted: 0
<< CpicConnection

	Asynchron: 0
	Has Error: 0
<< RfcConnection

2020-04-04 18:16:53.272157 [12896] 	>> RfcConnection::disconnect
2020-04-04 18:16:53.272204 [12896] 	>> CpicConnection::close
2020-04-04 18:16:53.283041 [12896] 	<< CpicConnection::close returned 0

2020-04-04 18:16:53.283110 [12896] 	<< RfcConnection::disconnect returned 0

2020-04-04 18:16:53.283157 [12896] 	<< RfcConnection::closeHandle returned 0

2020-04-04 18:16:53.283184 [12896] << RfcCloseConnection returned RFC_OK

2020-04-04 18:16:53.283285 [12896] >> RfcCloseConnection
2020-04-04 18:16:53.283310 [12896] 	>> RfcConnection::closeHandle
>> RfcConnection name="none"
Handle = 2138513959152 ConvID = 23018303
>> RfcControl name=".."
	Program Name: SAPLSYST                                
	Codepage: 4103		Communication Codepage: 1100
	SysID: ..		Target: ..		Mandant: ..		SysNr: 00		Language: D		User: ..
	Conversation ID: 23018303	Role: C753
	Partner Rel: 740 		Partner Kernel Rel: 749 		Partner IP: ..		Partner IPv6: ..		Protocol version: 3595
	Exclusive: false		Get SSO2: false		Use basXML: 0
<< RfcControl

>> CpicConnection
	CPIC Handle: 0
	Gateway Host: 		Gateway Service: 		AS Host: 		AS Service: 		Program ID: 
	Protocol: I
	Conversation ID: 23018303
	Last IO Time: Sat Apr 04 18:16:53 2020

	Data Received: 3		Status Received: 2		All Received: 1
	Connected: true		Deallocated: false		Accepted: 0
	Server Mode: false		Registered Mode: false		Just Accepted: 0
<< CpicConnection

	Asynchron: 0
	Has Error: 0
<< RfcConnection

2020-04-04 18:16:53.283547 [12896] 	>> RfcConnection::disconnect
2020-04-04 18:16:53.283572 [12896] 	>> CpicConnection::close
2020-04-04 18:16:53.293954 [12896] 	<< CpicConnection::close returned 0

2020-04-04 18:16:53.294004 [12896] 	<< RfcConnection::disconnect returned 0

2020-04-04 18:16:53.294040 [12896] 	<< RfcConnection::closeHandle returned 0

2020-04-04 18:16:53.294069 [12896] << RfcCloseConnection returned RFC_OK

2020-04-04 18:16:53.301973 [12896] >> RfcDestroyFunction
Deleting function ..
2020-04-04 18:16:53.302058 [12896] << RfcDestroyFunction returned RFC_OK

bsrdjan · 2020-04-13T17:34:34Z

Checked together with NWRFC SDK experts and RFC_UNKNOWN_ERROR, coming from levels below the SDK, makes it hard to guess what could be wrong here.

I found two critical node-rfc gaps which might have impact here:

Closing the connection possible while RFC call ongoing
Pool reopen() method, used when releasing the connection back to Pool, resolves to void/undefined. It should be the Client instance, just like the Pool open() method

These gaps, leading to issues like #132 and #133, are now fixed in master.

Client locking is also checked in detail and more unit tests added for Pool.

All these fixes will be shipped with next release, as soon the #122 fixed, waiting for prebuild/prebuild#267.

If waiting for prebuild issue takes longer, please feel free to build from current master: 5837df5

These new added features might be helpful in troubleshooting and custom logging:

Number of RFC calls currently running via the client instance: https://github.com/SAP/node-rfc/blob/master/test/locking/callback.spec.js#L41
Throughput of one or more client instances: https://github.com/SAP/node-rfc/blob/master/test/throughput.spec.js
Client instance created, last open/call/close timestamps: https://github.com/SAP/node-rfc/blob/master/test/throughput.spec.js

Hope this helps :)

bsrdjan · 2020-04-14T07:17:06Z

Changes are published in v1.1.0

clausreinke · 2020-04-14T10:18:08Z

Thanks, it is good to see you're active on these issues (and in touch with the SDK folks) and that there is visible progress. Though the "closing while active" issue should not affect us.

I did not have as much opportunity to test last week as before and did not keep the trace on all the time but -subjectively- the number of incidents seems to have gone down (from at least once per day to less than once per day, over a week). Will update and check whether the additional info helps with logging.

And having the existing fixes published prompts me to update our other servers (I kept them running on 1.0.3 for comparison, and they're still having more frequent issues, including random shutdowns).

The thing that confused me about the last logs I provided was that one call failed and another, interleaved, call got the INVALID_HANDLE - was that a coincidence of two calls failing one after the other with different errors or was there some interaction between one call failing and the other ones error message?

Btw, when browsing the trace files, I see that the "Sending data" entries vary between code pages 4103 and 1100 while the "Receiving data" entries vary between 4103 and unknown - is that normal (I would expect sending/receiving to stick to one known code page each)?

clausreinke · 2020-04-14T13:39:53Z

I'm currently a bit confused about the extra outputs, in particular .status timestamps and ._connectionHandles. I am also starting to wonder where the client .id comes from and how it is expected to behave (as it is used to index the pool).

First, I noticed that some freshly acquired Clients had lastcall stamps that were smaller than their lastopened stamps, then I followed one ._connectionHandle and saw it reappear with different client ids And the client ids are not monotonically increasing, either.

Slice of log for specific ._connectionHandle and the .ids associated with it (in order, each line has client .id, ._connectionHandle, .isAlive, .runningRFCCalls, .status ):

12:54:18  post acquire           ----> 3 (1506563273056) true 0 {"created":1586861658889,"lastopen":1586861658889,"lastclose":0,"lastcall":0} 
12:54:20  post release (success) ----> 3 (undefined) false 0 {"created":1586861658889,"lastopen":1586861658889,"lastclose":1586861660586,"lastcall":1586861660114} 
14:22:48  post acquire           ----> 2 (1506563273056) true 0 {"created":1586861658666,"lastopen":1586866961499,"lastclose":0,"lastcall":1586866961390} 
14:22:48  post release (success) ----> 2 (undefined) false 0 {"created":1586861658666,"lastopen":1586866968976,"lastclose":0,"lastcall":1586866968887} 
14:22:49  post acquire           ----> 2 (1506563273056) true 0 {"created":1586861658666,"lastopen":1586866968976,"lastclose":0,"lastcall":1586866968887} 
14:22:49  post release (success) ----> 2 (undefined) false 0 {"created":1586861658666,"lastopen":1586866969261,"lastclose":0,"lastcall":1586866969166} 
14:22:49  post acquire           ----> 2 (1506563273056) true 0 {"created":1586861658666,"lastopen":1586866969261,"lastclose":0,"lastcall":1586866969166} 
14:22:49  post release (success) ----> 2 (undefined) false 0 {"created":1586861658666,"lastopen":1586866969861,"lastclose":0,"lastcall":1586866969830} 
14:22:50  post acquire           ----> 2 (1506563273056) true 0 {"created":1586861658666,"lastopen":1586866969861,"lastclose":0,"lastcall":1586866969830} 
14:23:00  post release (success) ----> 2 (undefined) false 0 {"created":1586861658666,"lastopen":1586866969861,"lastclose":1586866980632,"lastcall":1586866970450}

clausreinke · 2020-04-15T10:08:20Z

Here is an INVALID_HANDLE with the latest versions (again, trace wasn't on, but some of the new logging options were)

versions:
{"major":7500,"minor":0,"patchLevel":6,"binding":"1.1.0"}

connectionInfo:
{
  "dest": "",
  "host": "-",
  "partnerHost": "-",
  "sysNumber": "00",
  "sysId": "-",
  "client": "-",
  "user": "-",
  "language": "D",
  "trace": "0",
  "isoLanguage": "DE",
  "codepage": "4103",
  "partnerCodepage": "4103",
  "rfcRole": "C",
  "type": "E",
  "partnerType": "3",
  "rel": "753",
  "partnerRel": "740",
  "kernelRel": "749",
  "cpicConvId": "89864650",
  "progName": "SAPLSYST",
  "partnerBytesPerChar": "2",
  "partnerSystemCodepage": "4103",
  "partnerIP": "-",
  "partnerIPv6": "-"
}

given that previous traces showed code pages 4103 and 1100, is it odd that there is only 4103 here?

As the connectionHandle sees some reuse, here is a log slice wrt that handle, ending in an INVALID_HANDLE directly on acquire

most of the time, the handle remains defined and the client valid even after release
except for the last acquire, the same handle ist acquired with different ids
except for the last acquire, the timestamps change even if the handle does not

18:31:23 connection, post acquire----> 3 (3069229760224) true 0 {"created":1586881883129,"lastopen":1586881883129,"lastclose":0,"lastcall":0} {connectionInfo}
18:31:23 RFC F1 called
18:31:25 RFC F1 (after 2.821s) returns
18:31:26 connection, post release (success) ----> 3 (3069229760224) true 0 {"created":1586881883129,"lastopen":1586881883129,"lastclose":1586881886000,"lastcall":1586881883176} {}

19:12:24 connection, post acquire----> 4 (3069229760224) true 0 {"created":1586884344466,"lastopen":1586884344466,"lastclose":0,"lastcall":0} {connectionInfo}
19:12:24 RFC F1 called
19:12:26 RFC F1 (after 2.269s) returns
19:12:26 connection, post release (success) ----> 4 (undefined) false 0 {"created":1586884344466,"lastopen":1586884344466,"lastclose":1586884346760,"lastcall":1586884344489} {}

19:12:29 connection, post acquire----> 5 (3069229760224) true 0 {"created":1586884349515,"lastopen":1586884349516,"lastclose":0,"lastcall":0} {connectionInfo}
19:12:29 RFC F2 called
19:12:30 RFC F2 (after 1.058s) returns
19:12:30 connection, post release (success) ----> 5 (3069229760224) true 0 {"created":1586884349515,"lastopen":1586884350590,"lastclose":0,"lastcall":1586884349531} {}

19:12:30 connection, post acquire----> 5 (3069229760224) true 0 {"created":1586884349515,"lastopen":1586884350590,"lastclose":0,"lastcall":1586884349531} {
  "name": "RfcLibError",
  "code": 13,
  "key": "RFC_INVALID_HANDLE",
  "message": "An invalid handle 'RFC_CONNECTION_HANDLE' was passed to the API call"
}

bsrdjan · 2020-04-15T13:01:32Z

Thanks a lot for the very valuable input.

Btw, when browsing the trace files, I see that the "Sending data" entries vary between code pages 4103 and 1100 while the "Receiving data" entries vary between 4103 and unknown - is that normal (I would expect sending/receiving to stick to one known code page each)?

Yes, that’s normal. No need to worry, the codepages are handled automatically between NWRFC SDK and RFC server system.

I am also starting to wonder where the client .id comes from and how it is expected to behave (as it is used to index the pool).

The client id comes from N-API/C wrapper and incremented with each new Client instance. After opening the connection, the client gets the connection handle. The handle can change after re-opening the connection, id stays the same. The reopen happens after the client released back to pool and acquired for new task, for example.

The thing that confused me about the last logs I provided was that one call failed and another, interleaved, call got the INVALID_HANDLE - was that a coincidence of two calls failing one after the other with different errors or was there some interaction between one call failing and the other ones error message?

This could be the actually the cause of getting RFC_INVALID_HANDLE. After certain errors, the connection is automatically closed and cleaned-up by NWRFC SDK. The node-rfc was so far in this regard transparent, only updating isAlive flag. Checking this flag before each call in NodeJS is not enough because one client can start multiple client.call() calls in parallel and all would see alive === true. After one call fails with critical error, the connection is broken, not anymore useful for other pending calls. When the next call comes for execution, the RFC_INVALID_HANDLE is raised.

With the latest commit 5794944, the node-rfc automatically re-opens the connection, after critical errors. This should eliminate or at least considerably reduce RFC_INVALID_HANDLE errors.

If the problem persists, I would try to reproduce by using the same stack (any other components, frameworks on top of node-rfc?) and calling RFC_PING_AND_WAIT and RFC_RAISE_ERROR ABAP modules, to simulate some real ABAP processing.

clausreinke · 2020-04-16T11:40:10Z

After a full day of testing 1.1.0 on two servers, that version definitely generates a lot more RFC_INVALID_HANDLEs, but almost all of them come immediately after the acquire, the rest from reused handles. So that would support your theory. Will update to the current HEAD and test more.

I do find it odd that acquire would simply return something that looks like a client connection but is not valid. Reopening connections that are likely invalid should avoid this, but shouldn't acquire have returned or thrown a clear error condition instead of returning an invalid handle with an error hidden in the connectionInfo?

clausreinke · 2020-04-16T14:38:10Z

Still getting a lot of RFC_INVALID_HANDLEs with e7c1fba , mostly after acquire:-(

clausreinke · 2020-04-16T14:41:42Z

Could there be an option for not reusing connections (ie not putting the reopened client back into the pool after release but just closing it), so that acquire always returns fresh clients?

bsrdjan · 2020-04-16T14:54:02Z

The reason for empty connectionInfo is still open and discussed with NWRFC SDK experts but hard to guess without reproducible example.
Which components/frameworks are used on servers, on top of Pool? How the typical application looks like?
I would try to make more stress tests, with similar setup.

bsrdjan · 2020-04-17T06:01:46Z

You may test also the Pool without reuse of released connections 9833196.

In parallel I will check if Pool functionality could be moved to C, making it less dependable on Promises implementation.

bsrdjan · 2020-04-17T12:06:58Z

I observed some unit test behave differently when running from the command line and via npm script, executing exactly the same command line. Sometimes, via npm script, the tests were failing because the client.id could not be referenced because the client is undefined. Isolated tests and all tests via command line were always passing and failing only when running all tests via npm script.

Suspecting it could be somehow related to Promise used in Pool internal active/ready collections, I replaced them with Client instances, reducing the overhead.

The npm testing issues are now gone and perhaps it could improve the situation in your environment: a6ec7ef

If not, the Pool shall be moved to C level, becoming independent of Promise implementations on different platforms.

clausreinke · 2020-04-22T20:12:36Z

I'm still observing acquire returning a client where getting the connectionInfo will return an INVALID_HANDLE error and where the _connectionHandle has been seen before in the same server run (even though lastcall is 0 and the id is different):

21:10:36 connection, post acquire ----> 54 (2783525369200) true 0 {"created":1587582636762,"lastopen":1587582636762,"lastclose":0,"lastcall":0} {"major":7500,"minor":0,"patchLevel":6,"binding":"1.2.0"} {
  "alive": true,
  "name": "RfcLibError",
  "code": 13,
  "codeString": "RFC_INVALID_HANDLE",
  "key": "RFC_INVALID_HANDLE",
  "message": "An invalid handle 'RFC_CONNECTION_HANDLE' was passed to the API call"
}

As before, the INVALID_HANDLE in getting the connectionInfo predicts/precedes the same error on the actual invoke that follows.

clausreinke · 2020-04-27T16:11:47Z

I've collected 2 somewhat longer log slices, hoping to see some connections.

As before, the individual calls proceed as follows:

acquire
invoke
then either
- return info + release (success)
- release (error) + error message + return info

acquire and release have an await, invoke a callback (the release calls are in this callback).

post acquire and release, we log client .id , ._connectionHandle, .isAlive, .runningRFCCalls, .status, and .connectionInfo. The client.id is used to track the other events through interleaved calls.

The most common failure pattern is still:

P1.

post acquire, .connectionInfo gives an INVALID_HANDLE error,
the following invoke raises the same error

However, a couple of other patterns can also be observed:

P2. 2020-04-24 11:51:01

post acquire for id 27, we have a client with INVALID_HANDLE error in .connectionInfo
a second acquire for id 28 returns the very same ._connectionHandle (27 has not been released at this point)
the invoke for 28 results in a communication failure

{
  "alive": true,
  "name": "RfcLibError",
  "code": 1,
  "codeString": "RFC_COMMUNICATION_FAILURE",
  "key": "RFC_COMMUNICATION_FAILURE",
  "message": {
	"message(split at line breaks)": [
	  "LOCATION    CPIC (TCP/IP) on local host with Unicode",
	  "ERROR       program state check for conversation 33641697",
	  "TIME        Fri Apr 24 13:35:57 2020",
	  "RELEASE     753",
	  "COMPONENT   CPIC (TCP/IP) with Unicode",
	  "VERSION     3",
	  "RC          471",
	  "MODULE      D:/depot/bas/753_REL/src/krn/si/cpic/r3cpic.c",
	  "LINE        6378",
	  "DETAIL      called function STSEND in state state=BUFFER_DATA2",
	  "COUNTER     8"
	]
  }
}

P3. 2020-04-24 13:00:37

post acquire for id 154, we have a client with INVALID_HANDLE error in .connectionInfo
however, the following invoke for 154 succeeds!

Pattern 3 repeats 2020-04-24 13:30:27, id 16, and 2020-04-24 13:30:28, id 17, 2020-04-24 13:33:03, id 19, 2020-04-24 13:33:05, id 20, all with the same ._connectionHandle.

Then, we get another instance of pattern 2 (2020-04-24 13:33:12, identical ._connectionHandle concurrently used for 2 client.ids, 22 and 23, leading to RFC_COMMUNICATION_FAILURE.

Back to pattern 3 for a while, then another pattern 2, this time with 3 ids that share the same ._connectionHandle (2020-04-24 13:35:54, ids 32, 33, and 34). The acquire for 34 does not come with an INVALID_HANDLE error but the invoke for 34 results in RFC_COMMUNICATION_FAILURE.

One instance of pattern 3 and then we're back to pattern 1 again.

Todays log has the same 3 failure patterns. Pattern 3 means that an INVALID_HANDLE error for .connectionInfo after acquire is not a certain indicator of invoke failure, but currently, I'm more curious how several consecutive and even concurrent acquires can end up with the same ._connectionHandle, with differing .connectionInfos.

The frequent reuse of ._connectionHandles suggests that the .status might better be associated with handles, not with clients.

I cannot see any obvious culprit in node-rfc (I had several theories, none of which survived further inspection;-), so either there is a non-obvious issue (perhaps via node-api, or your suspected interaction of Promise implementations) or the issue may be in the SDK. This server is build via CI and does not yet include your Promise-avoidance commit but when I log Promise, I get [native code].
logslice-2020-04-24.txt
logslice-2020-04-27.txt

clausreinke · 2020-05-05T17:02:54Z

Hi again,

I've finally managed to produce a tiny example capable of reproducing the issue without involving our project code. It is not completely automated, as the issue is timing-related and -so far- needs a human touch to trigger.

invalid-handle.zip

The example (index.js + package.json) is a small express server with only a single route, which -via node-rfc- calls a standard ABAP function FUNCTION_IMPORT_DOKU to describe itself

unzip, npm install
add a ./sapnwrfc.ini with DEST=DEV
start the server: RFC_INI=./sapnwrfc.ini node index.js
visit the route in your browser: http://localhost:3456/doc/function
the non-trivial result is returned to the HTTP client and logged to console
now repeatedly reload, not neccessarily giving the server time to complete each request
after a few quick (<20, depending on your timing) repeats, the INVALID_HANDLE appears

(tried with node v14.1.0 on windows 10, using [email protected])

It seems important not only that the RFC call returns a non-trivial result, but that the JS thread is blocked for a while logging said result while new HTTP requests keep coming in.

clausreinke · 2020-05-11T08:44:01Z

Hi @bsrdjan
It has been quiet here these last 3 weeks - have you been able to reproduce the issue using the mini-project I provided last week? Any ideas/suggestions on how to proceed?

bsrdjan · 2020-05-12T11:22:06Z

Hi @clausreinke and sorry for the longer waiting. I just did a brief test and can investigate in detail next week. Thank you very much for this example, it helps reproducing on Linux. Did not try on Windows but expect the same result. I slightly modified the test setup as follows:

start the server in background mode, redirecting the output to log file:

$ RFC_INI=./sapnwrfc.ini node index.js > log.txt &

repeat reload by:

$ for i in {1..10}; do curl localhost:3456/doc/function; done

Please find logs with RFC_INVALID_HANDLE attached, need to investigate in detail: logs.zip

clausreinke · 2020-05-27T10:14:50Z

Hi again; I had another look through the node-rfc code and could still not find any obvious trouble spots. So I was wondering whether the code might tell a higher-level story that somehow could break down at a lower level (so we'd only think everything is fine).

So far, I've only found one potential issue along those lines. Perhaps you've covered it already but as we're stuck, I thought I'd mention it.

Background: according to the programming guide, the Sap NW RFC library is thread-safe, but for communication and table handles. node-rfc covers table handles by copying the data and communication handles by using a semaphore invocationMutex.

potential issue: each invocationMutex is associated with a node-rfc Client, not with the underlying communication handle (should there be a map from handles to locks instead?)
- now, if the communication handle is closed behind the scenes, because of an error, it could be reused for another client, resulting in two clients, two semaphores, two threads and one handle
- for RfcInvoke, the code already protects against this case by opening an new connection if the old one was closed due to error
- the API docs only mention the closing of handles in error case for RfcInvoke but the error codes are identical for other API-calls that involve handles - do they need the same protection? For instance, what happens if RfcGetFunctionDesc runs into an error that closes the connection handle?
question: client.connectionInfo calls RfcGetConnectionAttributes without locking; it is read-only, so this seems reasonable but could it be possible for other threads to change the connection handle while RfcGetConnectionAttributes is running? Could that cause issues?

clausreinke · 2020-05-28T08:47:09Z

Yay!-)

Following the theory from #147, I've instrumented node-rfc to log, in the Client deconstructor, which client id closes which handle; I've also activated the LOCK logging, add client id and handle there, too; then I've re-run the reproduction example:

In many cases, at least, this does indeed seem to be the cause of trouble, because a client that previously owned a handle closes it again when it has already been reopened by another client - that other client then fails because the handle has disappeared.

I suggest to keep track of whether a client has lost control of its handle

by closing it explicitly
by running into an error that closed it implicitly

As the latter will happen in Execute, this is probably where the checks will have to be (don't call anything if handle control has been lost, after call, check whether handle control has been lost due to error). If neccessary/helpful, OnOK can reestablish control by opening a new handle. And the deconstructor must not try to close a handle it has lost control of.

bsrdjan · 2020-05-28T15:17:24Z

Hi @clausreinke,

thank you for the thoughtful analysis and the patience with this issue.

Based on you example, the test setup with express is in invalid_handle folder and the issue is reproducible with run.sh script.

The test is based on your example, with following added:

The node-rfc dev version used, for faster testing
The client http requests simulated by curl/wget/...
The client http request number can be optionally passed to express and used in tracing of Pool states during requests.

The test with let say 20 requests, creates 5 RFC trace files, indicating 5 threads consuming the Pool. The current implementation of the Pool can't handle that safely because there is no lock/mutex to synchronise taking one connection from pool and giving it back. That must be protected.

Inspecting RFC traces, I saw connection handles occurring after closing the connection and learned that handles are not necessarily unique during the session.

The handle is just the pointer to the C ++ object, and after a free / delete the C-Runtime can re-use the same address again in a subsequent malloc / new. It happens very often (especially on Windows), that ever smarter memory management system “notices” the block of memory is just the right size and the block is re-used, instead of allocating a new one, causing unnecessary fragmentation.

The synchronised consumption of connections fails normally in a following way. The Thread A opens connection x and does something with it. The call ends in SysFailure / CommFailure, so that the connection breaks (and is deleted again). The application does not pay attention to the return code and tries to continue and use actually broken connection. This normally results in INVALID_HANDLE.

In current Pool implementation, the connection is closed when released back to Pool, as attempt to prevent reusing eventually broken connection. But even that is not possible with synchronised threads. The close in release shall be definitely removed and connections in release closed only if the total > Pool.max.

The traces of the current Pool implementation show something like:
Before Thread A "uses" handle x, Thread B comes along and also opens a connection.
By chance, the C-Runtime also takes the address x again. So now the deal is "valid" again.
Now thread A continues, just using the connection that thread B just opened for itself ...
(It took a while until debugged and understood ...)

The lower the frequency with which connections are opened and closed again, the greater is statistical probability that this will happen.

I would therefore move the Pool to C, synchronise with existing Client mutex, in next couple of days and re-test :)

clausreinke · 2020-06-02T01:10:57Z

I'm not sure I understand you correctly, in particular, how moving the connection pool from js to c would help with the multiple threads. There should be only a single thread evaluating js callbacks, the other threads should be from a libuv thread pool, accessed via napi async workers; according to docs, only the Execute method actually runs outside the main thread (main thread being the event loop running js code), on one of the libuv worker threads.

The client semaphore only works for the connection as long as the latter is open - once the connection has been closed, be it by the client or by the backend or by the rfc library, the client must not use it again, semaphore or not; in particular, the close in the client destructor (while necessary) must be guarded, not by whether the connection is currently valid, but by whether it has already been closed since this client opened it.

Even if a client successfully closes its connection, it must not try to do so again in its destructor, because another client (using a different semaphore) may have opened the same connection in the meantime.

bsrdjan · 2020-07-15T17:04:12Z

It took more time and here the re-designed version: v2-pre branch.

Please feel free to test, review and share the feedback. The most important changes are in the changelog and in documentation: direct-and-managed-clients

bsrdjan · 2020-07-20T10:08:58Z

The version 2.0 published, should fix the issue.

bsrdjan · 2020-08-25T09:19:13Z

Just checking the status here. Any updates in the meantime, any issues left ?

bsrdjan · 2020-09-03T11:30:24Z

Let close this one and open new issue of needed.

olsonperrensen · 2022-06-17T12:57:42Z

Can someone provide me a mirror link for nwrfc750P ?

bsrdjan · 2022-06-17T13:02:05Z

Can someone provide me a mirror link for nwrfc750P ?

There are no mirrors, SAP Support Portal is the only distribution channel for SAP NW RFC SDK

https://support.sap.com/en/product/connectors/nwrfcsdk.html

Information on where to download the 7.50 version of the SAP NetWeaver RFC SDK and information about currently supported platforms can be found in SAP Note 2573790.

bsrdjan added a commit that referenced this issue Mar 19, 2020

Concurrent locking fix for invoke() and ping(), related to #128

9cebb9a

bsrdjan added a commit that referenced this issue Mar 20, 2020

Concurrent locking fix for ConnectionInfo(), related to #128

c5a443b

bsrdjan added a commit that referenced this issue Mar 20, 2020

ConnectionInfo update for 750 PL6, related to #128

e55783f

bsrdjan added a commit that referenced this issue Apr 2, 2020

Connection close after fatal errors, related to #128

8e2b13b

bsrdjan added a commit that referenced this issue Apr 2, 2020

Connection close after fatal errors, related to #128

11bc2a3

clausreinke mentioned this issue Apr 16, 2020

stack traces for potential segfaults? #138

Closed

clausreinke mentioned this issue May 28, 2020

RFC_COMMUNICATION_FAILURE #147

Closed

clausreinke mentioned this issue Jun 6, 2020

Track handle status #149

Merged

bsrdjan closed this as completed Sep 3, 2020

igorpocta mentioned this issue Jan 12, 2021

RFC_UNKNOWN_ERROR 28 - Unknown error in lower level library SAP-archive/PyRFC#215

Closed

RFC_INVALID_HANDLE #128

RFC_INVALID_HANDLE #128

Comments

clausreinke commented Mar 16, 2020

bsrdjan commented Mar 19, 2020

clausreinke commented Mar 20, 2020

bsrdjan commented Mar 20, 2020

clausreinke commented Mar 25, 2020

clausreinke commented Mar 25, 2020

bsrdjan commented Apr 2, 2020 • edited Loading

clausreinke commented Apr 3, 2020

clausreinke commented Apr 6, 2020

bsrdjan commented Apr 13, 2020

bsrdjan commented Apr 14, 2020

clausreinke commented Apr 14, 2020

clausreinke commented Apr 14, 2020

clausreinke commented Apr 15, 2020

bsrdjan commented Apr 15, 2020

clausreinke commented Apr 16, 2020

clausreinke commented Apr 16, 2020

clausreinke commented Apr 16, 2020

bsrdjan commented Apr 16, 2020

bsrdjan commented Apr 17, 2020

bsrdjan commented Apr 17, 2020

clausreinke commented Apr 22, 2020

clausreinke commented Apr 27, 2020 • edited Loading

clausreinke commented May 5, 2020

clausreinke commented May 11, 2020

bsrdjan commented May 12, 2020 • edited Loading

clausreinke commented May 27, 2020

clausreinke commented May 28, 2020 • edited Loading

bsrdjan commented May 28, 2020

clausreinke commented Jun 2, 2020 • edited Loading

bsrdjan commented Jul 15, 2020

bsrdjan commented Jul 20, 2020

bsrdjan commented Aug 25, 2020

bsrdjan commented Sep 3, 2020

olsonperrensen commented Jun 17, 2022

bsrdjan commented Jun 17, 2022

bsrdjan commented Apr 2, 2020 •

edited

Loading

clausreinke commented Apr 27, 2020 •

edited

Loading

bsrdjan commented May 12, 2020 •

edited

Loading

clausreinke commented May 28, 2020 •

edited

Loading

clausreinke commented Jun 2, 2020 •

edited

Loading