add_entity has periodic long wait that raise an Exception when exceeds 15 minutes #45

IntoCpp · 2020-02-20T20:05:12Z

Using: azure-cosmosdb-table==1.0.6

Problem description: Keeping an instance of TableService for a long time (hours) to perform regular “add_entity” result in rare but recurring long transaction time: up to 15+ minutes. When the 15 minute mark is exceeded on a transaction an exception occurs of type “EntityAlreadyExists”. Querying the database then shows the data was saved at the time of the request (15 minutes earlier).

Expected behavior:

An instance of TableServices should work without problems for many days (months even) to perform any number of transactions.
If a problem occurs on a transaction making it unusually long, feedback should be reported in a timely manner so the user can take appropriate actions for the context (wait, continue, abandon, etc.)

Problem details:
Context: every 30 seconds I gather data and send them to an Azure CosmosDb using TableService via add_entity. Here is some pseudo code of such an operation:

from azure.cosmosdb.table import TableService

table_service = TableService(my_account_name, my_sas_token)
while run_for_48_hours():
   test_data = get_latest_data()
   table_service.insert_entity(table_name, test_data)
   time.sleep(30)

Please find later, via attachment, code sample and a stack-trace when the problem occurs.

Problem observation:
The call to add_entity usually returns in less than 0.2 second.
Every once in a while the call takes anywhere from several seconds to 15 minutes to return, but recovers without any reported error, no raised exception. This sometime occurs in a 2 to 30 seconds range but most of the time between 100 to 960 seconds.
And then at some point, after running for 2 to 16 hours, the call exceed 15 minutes and an “EntityAlreadyExists” exception is raised. In fact for the first several days of tracking this issue, the exception always occurred at exactly 962 seconds. This cannot be a coincidence. (Since then I witnessed longer time, but in a test environment with less trace enabled).

Weird: When the exception occurs, checking the database shows the last data-entry with a time-stamp at the exact second the function was called, (about 15 minutes before the exception was thrown). Was the transaction successful but failed to be reported and retried by the client side for the last 15 minutes? The EntityAlreadyExist exception was a valid reported error. But it is very weird this only occurs at 962 second mark but never when a slow operation occurs under that time.

Work-around: Create a new instance of the TableService each time we write to the database. This works but should not be necessary.

Additional comments:

The documentation states a configurable server-side timeout (default to 30 seconds) for all TableService operations. Changing the value to 5 or 10 seconds did not change the observed issue. I assume that since the transaction was a success (data found in database) we do not fall into the server time-out code.
Waiting 15 minutes to report a problem is very long for what is usually a sub 0.2 second transaction.
I’m no HTTP specialist, but I am surprised that any HTTP based transaction would give the observed behavior. Maybe a hint to the problem?
A resource management (leak) problem is suspected since this issue only occurs on “long lived” (hours) instances of TableService.
Ideally a solution would permit the user to give an (optional) time-out value to report any transaction delay a lot faster. We could then take proper actions, be it to move on to the next iteration or verify if it was saved and retry.

Code used to reproduce the problem:
Please find in attachment the test I used to reproduce the issue, run under PyTest: test_stress_azure_tiny.txt

Test results: (latest)
The test was started at 8h17, last entry in DB was 17h02 and the exception stopped the test at 17h36 (34 minutes in limbo). So this time we waited 8+hours for the problem to occur.

The last entry from that test:
{
"PartitionKey": "azure_stress_test",
"RowKey": "998417850225.1086",
"Timestamp": "2020-02-19T22:03:56.835937+00:00",
"etag": "W/"datetime'2020-02-19T22%3A03%3A56.8359377Z'"",
"iteration": 1027,
"last_test_run_time": 0.01710820198059082,
"test_time_local": "2020-02-19 17:02:53.891263"
},

Pytest failure terminal output:

(log_server) 08:17:45 ~/Documents/work/dev/log_server >pytest test_stress_azure_tiny.py -vv -s

============================================ test session starts =============================================
platform linux -- Python 3.7.5, pytest-5.3.5, py-1.8.1, pluggy-0.13.1 -- /home/into_azure/.local/share/virtualenvs/log_server-i033OCQD/bin/python3.7m
cachedir: .pytest_cache
rootdir: /home/into_azure/Documents/work/dev/log_server, inifile: pytest.ini
collected 1 item

test_stress_azure_tiny.py::test_insert_entity_in_a_long_lived_table_service
Test running for 30m3s
Test running for 1h0m6s
Test running for 1h30m8s
Test running for 2h0m11s
... (removed some)
Test running for 8h0m45s
Test running for 8h30m47s
FAILED

================================================== FAILURES ==================================================
______________________________ test_insert_entity_in_a_long_lived_table_service ______________________________

    def test_insert_entity_in_a_long_lived_table_service():
        """
        In this one we keep alive the same instance of TableService, see if
        the problem is in the internal resource handling.
        """
### Some source code removed, since it was already provided earlier in attachment.

            # ***** THIS IS THE TEST *****
>           etag = table_service.insert_entity(table_name, dummy_data_dict)

test_stress_azure_tiny.py:86:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../../.local/share/virtualenvs/log_server-i033OCQD/lib/python3.7/site-packages/azure/cosmosdb/table/tableservice.py:933: in insert_entity
    return self._perform_request(request, _extract_etag)
../../../../.local/share/virtualenvs/log_server-i033OCQD/lib/python3.7/site-packages/azure/cosmosdb/table/tableservice.py:1106: in _perform_request
    return super(TableService, self)._perform_request(request, parser, parser_args, operation_context)
../../../../.local/share/virtualenvs/log_server-i033OCQD/lib/python3.7/site-packages/azure/cosmosdb/table/common/storageclient.py:430: in _perform_request
    raise ex
../../../../.local/share/virtualenvs/log_server-i033OCQD/lib/python3.7/site-packages/azure/cosmosdb/table/common/storageclient.py:358: in _perform_request
    raise ex
../../../../.local/share/virtualenvs/log_server-i033OCQD/lib/python3.7/site-packages/azure/cosmosdb/table/common/storageclient.py:344: in _perform_request
    HTTPError(response.status, response.message, response.headers, response.body))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

http_error = HTTPError('Conflict')

    def _http_error_handler(http_error):
        ''' Simple error handler for azure.'''
        message = str(http_error)
        error_code = None

        if 'x-ms-error-code' in http_error.respheader:
            error_code = http_error.respheader['x-ms-error-code']
            message += ' ErrorCode: ' + error_code

        if http_error.respbody is not None:
            message += '\n' + http_error.respbody.decode('utf-8-sig')

        ex = AzureHttpError(message, http_error.status)
        ex.error_code = error_code

>       raise ex
E       azure.common.AzureConflictHttpError: Conflict
E       {"odata.error":{"code":"EntityAlreadyExists","message":{"lang":"en-US","value":"The specified entity already exists.\nRequestId:3baa0619-3002-0073-5375-e7ca5e000000\nTime:2020-02-19T22:37:29.1599282Z"}}}

../../../../.local/share/virtualenvs/log_server-i033OCQD/lib/python3.7/site-packages/azure/cosmosdb/table/common/_error.py:115: AzureConflictHttpError
--------------------------------------------- Captured log call ----------------------------------------------
ERROR    azure.cosmosdb.table.common.storageclient:storageclient.py:429 Client-Request-ID=940c305a-5363-11ea-a369-c3cf23644b51 Retry policy did not allow for a retry: Server-Timestamp=Wed, 19 Feb 2020 22:37:28 GMT, Server-Request-ID=3baa0619-3002-0073-5375-e7ca5e000000, HTTP status code=409, Exception=Conflict{"odata.error":{"code":"EntityAlreadyExists","message":{"lang":"en-US","value":"The specified entity already exists.\nRequestId:3baa0619-3002-0073-5375-e7ca5e000000\nTime:2020-02-19T22:37:29.1599282Z"}}}.
====================================== 1 failed in 32840.75s (9:07:20) =======================================
(log_server) 17:36:26 ~/Documents/work/dev/log_server >

Stack trace obtained while a transaction is stuck

The test code and console output provided gives the stack when the exception occurs. But more interesting is: where is the stack when the TableService transaction appear to be stuck? It's doing “something” for 15+ minutes, but what? To obtain the stack of a stuck transaction, I use TableService in a separate thread, and dump the stack when it was taking too long to complete. I dumped the stack at regular intervals while researching this and always obtained the same stack.

STACK TRACE: of add_entity transaction pending for several minutes:
(using faulthandler.dump_traceback)

2020-02-12 15:32:40.848047
==> dump_stack was called (#29)
Thread 0x00007f5554daa700 (most recent call first):
File "/home/into_azure/.local/share/virtualenvs/log_server-i033OCQD/lib/python3.7/site-packages/urllib3/util/wait.py", line 105 in do_poll
File "/home/into_azure/.local/share/virtualenvs/log_server-i033OCQD/lib/python3.7/site-packages/urllib3/util/wait.py", line 43 in _retry_on_intr
File "/home/into_azure/.local/share/virtualenvs/log_server-i033OCQD/lib/python3.7/site-packages/urllib3/util/wait.py", line 107 in poll_wait_for_socket
File "/home/into_azure/.local/share/virtualenvs/log_server-i033OCQD/lib/python3.7/site-packages/urllib3/util/wait.py", line 146 in wait_for_read
File "/home/into_azure/.local/share/virtualenvs/log_server-i033OCQD/lib/python3.7/site-packages/urllib3/contrib/pyopenssl.py", line 325 in recv_into
File "/home/into_azure/anaconda3/lib/python3.7/socket.py", line 589 in readinto
File "/home/into_azure/anaconda3/lib/python3.7/http/client.py", line 267 in _read_status
File "/home/into_azure/anaconda3/lib/python3.7/http/client.py", line 306 in begin
File "/home/into_azure/anaconda3/lib/python3.7/http/client.py", line 1344 in getresponse
File "/home/into_azure/.local/share/virtualenvs/log_server-i033OCQD/lib/python3.7/site-packages/urllib3/connectionpool.py", line 416 in _make_request
File "/home/into_azure/.local/share/virtualenvs/log_server-i033OCQD/lib/python3.7/site-packages/urllib3/connectionpool.py", line 672 in urlopen
File "/home/into_azure/.local/share/virtualenvs/log_server-i033OCQD/lib/python3.7/site-packages/requests/adapters.py", line 449 in send
File "/home/into_azure/.local/share/virtualenvs/log_server-i033OCQD/lib/python3.7/site-packages/requests/sessions.py", line 646 in send
File "/home/into_azure/.local/share/virtualenvs/log_server-i033OCQD/lib/python3.7/site-packages/requests/sessions.py", line 533 in request
File "/home/into_azure/.local/share/virtualenvs/log_server-i033OCQD/lib/python3.7/site-packages/azure/cosmosdb/table/common/_http/httpclient.py", line 92 in perform_request
File "/home/into_azure/.local/share/virtualenvs/log_server-i033OCQD/lib/python3.7/site-packages/azure/cosmosdb/table/common/storageclient.py", line 321 in _perform_request
File "/home/into_azure/.local/share/virtualenvs/log_server-i033OCQD/lib/python3.7/site-packages/azure/cosmosdb/table/tableservice.py", line 1106 in _perform_request
File "/home/into_azure/.local/share/virtualenvs/log_server-i033OCQD/lib/python3.7/site-packages/azure/cosmosdb/table/tableservice.py", line 933 in insert_entity
File "/home/into_azure/Documents/work/dev/log_server/log.py", line 354 in push_points_in_database
File "/home/into_azure/Documents/work/dev/log_server/log.py", line 305 in log
File "/home/into_azure/Documents/work/dev/log_server/log.py", line 182 in run
File "/home/into_azure/anaconda3/lib/python3.7/threading.py", line 926 in _bootstrap_inner
File "/home/into_azure/anaconda3/lib/python3.7/threading.py", line 890 in _bootstrap

Current thread 0x00007f555bada740 (most recent call first):
File "log_server.py", line 53 in dump_stack
File "log_server.py", line 266 in <module>

The text was updated successfully, but these errors were encountered:

IntoCpp · 2020-02-25T15:42:39Z

The mentioned work-around may be worst. When we start having many threads working at the same time I get a lot of these exceptions:
HTTPSConnectionPool(host='my_account_name.table.core.windows.net', port=443): Max retries exceeded with url: /Tables('AzureUnitTESTS') (Caused by SSLError(SSLError("bad handshake: SysCallError(104, 'ECONNRESET')")))

Under investigation...

IntoCpp · 2020-02-28T18:40:42Z

Work-around (this one works): I created a new thread-pool object with a data pool to accumulate transactions to be sent. Other objects (also threads) fills the pool.
The thread-pool regularly (a few seconds) create a TableService, a new one each time, to send all the accumulated messages.
This compromise between a single TableService that is longed lived (months) and many TableServices short lived (fractions of second, constantly instantiated and used inside threads), appear to hit the sweet spot and works fine. For now :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add_entity has periodic long wait that raise an Exception when exceeds 15 minutes #45

add_entity has periodic long wait that raise an Exception when exceeds 15 minutes #45

IntoCpp commented Feb 20, 2020

IntoCpp commented Feb 25, 2020

IntoCpp commented Feb 28, 2020

add_entity has periodic long wait that raise an Exception when exceeds 15 minutes #45

add_entity has periodic long wait that raise an Exception when exceeds 15 minutes #45

Comments

IntoCpp commented Feb 20, 2020

Stack trace obtained while a transaction is stuck

IntoCpp commented Feb 25, 2020

IntoCpp commented Feb 28, 2020