Bridge continiously sending transactions with 'eth_sendTransaction timed out' message during stress testing #33

akolotov · 2018-03-15T22:16:37Z

This is the issue is the same as paritytech/parity-bridge#149 but behavior is even worse due to automatic bridge restart implemented in POA bridge.

Network setup:

Home: a PoA testnet (Sokol)
Foreign: Ropsten

There are 1200 deposit transactions sent successfully to HomeBridge contract by a special python script. It took 8 blocks to validate all transactions.
https://sokol-explorer.poa.network/block/1418808
...
https://sokol-explorer.poa.network/block/1418815

The bridge discovered part of these transactions, tried to relay some of them, lost the connection and restarted:

INFO:bridge::bridge::deposit_relay: got 7 new deposits to relay
INFO:bridge::bridge::deposit_relay: relaying 7 deposits
INFO:bridge::bridge::deposit_relay: deposit relay completed
INFO:bridge::bridge::deposit_relay: got 129 new deposits to relay
INFO:bridge::bridge::deposit_relay: relaying 129 deposits
INFO:bridge::bridge::deposit_relay: deposit relay completed
INFO:bridge::bridge::withdraw_relay: got 0 new signed withdraws to relay
INFO:bridge::bridge::withdraw_relay: fetching messages and signatures
INFO:bridge::bridge::withdraw_relay: fetching messages and signatures complete
INFO:bridge::bridge::withdraw_relay: relaying 0 withdraws
INFO:bridge::bridge::withdraw_relay: relaying withdraws complete
INFO:bridge::bridge::withdraw_relay: waiting for signed withdraws to relay
INFO:bridge::bridge::withdraw_confirm: got 0 new withdraws to sign
INFO:bridge::bridge::withdraw_confirm: signing
INFO:bridge::bridge::withdraw_confirm: signing complete
INFO:bridge::bridge::withdraw_confirm: submitting 0 signatures
INFO:bridge::bridge::withdraw_confirm: submitting signatures complete
INFO:bridge::bridge::withdraw_confirm: waiting for new withdraws that should get signed
INFO:bridge::bridge::deposit_relay: got 113 new deposits to relay
INFO:bridge::bridge::deposit_relay: relaying 113 deposits
INFO:bridge::bridge::deposit_relay: deposit relay completed
INFO:bridge::bridge::deposit_relay: got 710 new deposits to relay
INFO:bridge::bridge::deposit_relay: relaying 710 deposits
WARN:bridge: Bridge is down with Request eth_sendTransaction timed out, attempting to restart
WARN:<unknown>: Sending a response to deallocated channel: Ok([Ok(String("0xc47e8427c9eda913b9749bf0904cd8765b39f5448368194cb7aa4330bbe6a44d"))])
WARN:<unknown>: Sending a response to deallocated channel: Ok([Ok(String("0x4d96f7f39c50dd7b19aac9af708bca450fb6c9fb6b72024c7736d9883614845d"))])
...
WARN:<unknown>: Sending a response to deallocated channel: Ok([Ok(String("0x6b16c0ef9e34d65c4431c4e1e8c043564d53cf3ce440d7ea706164838904e209"))])

The database file was not updated so the restart of the bridge thread caused the same error.

INFO:bridge::bridge::deposit_relay: got 951 new deposits to relay
INFO:bridge::bridge::deposit_relay: relaying 951 deposits
WARN:<unknown>: Sending a response to deallocated channel: Ok([Ok(String("0x9b034a82ebf6306933a35848f9d7b1552ababb8be5774e5501e4802013911a01"))])
WARN:<unknown>: Sending a response to deallocated channel: Ok([Ok(String("0x606b57af0b13dbeddf6b4c16833b0365dace777cc4feee82d2de14c5dde53c45"))])

So, the bridge is continuing to send transactions forever.

Even if the bridge process is killed manually, it is necessary to do manual modification of database but it causes lock of funds on HomeBridge contract side since incomplete amount of tokens is transfered by ForeignBridge contract.

The text was updated successfully, but these errors were encountered:

akolotov · 2018-03-15T22:33:17Z

The first of transactions arrived on Ropsten: https://ropsten.etherscan.io/tx/0x42d204dbf87b8b53c00a1ce11c79ed6eee91e189ebb1106211db7f24ba20f4b6.

The last transaction: https://ropsten.etherscan.io/tx/0xb53ea50e62bfcfda6ba1441d243a4df56f472b2e7d9b895943dbd8dcb8257d47.

Totally it was about 17000 txs sent by the bridge till it was killed.

yrashk · 2018-03-16T05:58:46Z

On it

akolotov · 2018-03-16T06:10:53Z

config.toml - https://gist.github.com/akolotov/5c68b9438c991401df12450ad569a4f2

yrashk · 2018-03-16T06:12:04Z

Thank you Do you have a script that reproduces this issue?

akolotov · 2018-03-16T06:13:34Z

https://github.com/poanetwork/parity-bridge-research/blob/master/erc20/bridge/contracts/home_batch_deposit.py

akolotov · 2018-03-16T06:24:49Z

I think the root cause of the issue could relate to hardware on my testbed: it is an ordinary laptop with Intel i5-3317U CPU 1.70GHz, 4 cores, HDD (not SSD) and WiFi connection to Internet.
It is not my main workstation - it is dedicated just for tests: centos linux without X.org installed, parity, bridge - no other processes is being run.

yrashk · 2018-03-17T04:22:37Z

So, from the log above we can see that parity is actually sending a response to those timed out requests (bridge doesn't lose the connection, it simply abandons it after a timeout is experienced). The way bridge is structured it effectively considers timed out transactions to "never happen".

There are perhaps a couple of measures we can take here:

Increase timeout length. This does not require changing bridge code, just the config. In your example config, it is set to 10 seconds. What's the longest reasonable timeout you can think of?
Persist every outgoing transaction in a durable queue and always flush the queue before starting normal operations.
Limit the number of transactions that can be sent out simultaneously to relieve the pressure on Parity and prevent the timeout counter from being started too early (timeout is per-transaction)

yrashk · 2018-03-17T07:54:35Z

I think the most efficient first step on my end here would be (3). After that, we can do (2).

akolotov · 2018-03-17T18:16:34Z

I completely agree with you thoughts. #1 is too platform specific and do not provide any guaranty that in some moment a system will not get in a state when the timeout is too short again.

When too many transactions are being sent out, the response from the node comes after the operation has timed out. This is particularly noticeable on heavy loads on slower computers. Solution: chunk transactions into batches By default, the size of the batch is 2, however, it is important to note that since there's no coordination between different parties, there might be more than one batch at a time (but they should be within a single digit since the number of operations performed by bridge is limited) Addresses omni#33

yrashk · 2018-03-17T19:56:08Z

Would you mind trying yrashk@304a843 out on your hardware setup? This is a first draft. This change limits the size of the batch. The integration test pass. Let me know if this helps or not.

akolotov · 2018-03-20T09:51:00Z

I have tested the changes with 1K and 2K deposits (transactions).
1K deposits transferred successfully.
The original issue was reproduced with 2K deposits.

Here is difference in bridge logs:
1K deposts:

INFO:bridge::bridge::deposit_relay: got 353 new deposits to relay
INFO:bridge::bridge::deposit_relay: relaying 353 deposits
INFO:bridge::bridge::deposit_relay: deposit relay completed
INFO:bridge::bridge::deposit_relay: got 300 new deposits to relay
INFO:bridge::bridge::deposit_relay: relaying 300 deposits
INFO:bridge::bridge::deposit_relay: deposit relay completed

2K deposts:

INFO:bridge::bridge::deposit_relay: got 710 new deposits to relay
INFO:bridge::bridge::deposit_relay: relaying 710 deposits
WARN:bridge: Bridge is down with Request eth_sendTransaction timed out, attempting to restart
WARN:<unknown>: Sending a response to deallocated channel: Ok([Ok(String("0x6c881ae94549f05a09088e1ed017324eb4909e9957d61f5ad9a4967a2ac04ad3"))])

So, the bridge combined two sequential blocks in one batch for the second test.

It means that it is makes sense to understand why Parity behaves differently in these two cases. Most probably we will see a proper fix for the issue in that case.

When too many transactions are being sent out, the response from the node comes after the operation has timed out. This is particularly noticeable on heavy loads on slower computers. Solution: chunk transactions into batches By default, the size of the batch is 2, however, it is important to note that since there's no coordination between different parties, there might be more than one batch at a time (but they should be within a single digit since the number of operations performed by bridge is limited) Addresses omni#33

yrashk · 2018-05-29T19:22:53Z

Do we still experience this issue in any severe form?

akolotov · 2018-05-30T14:12:49Z

Did not test it with new version of bridge supporting RPC. Are you able to do generate traffic (dozen of transaction in one block) and test it by yourself?

akolotov added bug rust critical to do labels Mar 15, 2018

yrashk mentioned this issue Mar 24, 2018

Problem: sending too many transactions #35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bridge continiously sending transactions with 'eth_sendTransaction timed out' message during stress testing #33

Bridge continiously sending transactions with 'eth_sendTransaction timed out' message during stress testing #33

akolotov commented Mar 15, 2018

akolotov commented Mar 15, 2018

yrashk commented Mar 16, 2018

akolotov commented Mar 16, 2018

yrashk commented Mar 16, 2018

akolotov commented Mar 16, 2018

akolotov commented Mar 16, 2018

yrashk commented Mar 17, 2018 •

edited

Loading

yrashk commented Mar 17, 2018

akolotov commented Mar 17, 2018

yrashk commented Mar 17, 2018

akolotov commented Mar 20, 2018

yrashk commented May 29, 2018

akolotov commented May 30, 2018

Bridge continiously sending transactions with 'eth_sendTransaction timed out' message during stress testing #33

Bridge continiously sending transactions with 'eth_sendTransaction timed out' message during stress testing #33

Comments

akolotov commented Mar 15, 2018

akolotov commented Mar 15, 2018

yrashk commented Mar 16, 2018

akolotov commented Mar 16, 2018

yrashk commented Mar 16, 2018

akolotov commented Mar 16, 2018

akolotov commented Mar 16, 2018

yrashk commented Mar 17, 2018 • edited Loading

yrashk commented Mar 17, 2018

akolotov commented Mar 17, 2018

yrashk commented Mar 17, 2018

akolotov commented Mar 20, 2018

yrashk commented May 29, 2018

akolotov commented May 30, 2018

yrashk commented Mar 17, 2018 •

edited

Loading