-
Notifications
You must be signed in to change notification settings - Fork 10
Using Storage
Storage
is an interface that we use to abstract away various filesystems and cloud providers. You give it a provider layer path, and then you can download or upload files relative to that path.
Storage
provides (python) multithreading capability to accelerate uploads and downloads on http1 connections. You can set the number of threads to use. 0 threads means run everything on the main program thread. If you use too many (between 64 to 128 on the test machine below) it will crash.
By default, a storage instance spawns with a number of threads. Unfortunately, the del method of the storage object is called inconsistently (probably because of the timing of garbage collection). This means that you should clean up after you are finished with a storage object. There are three methods for cleaning up:
-
with
statement (preferred)
with Storage(...) as stor:
stor.put_file(...)
- storage.kill_threads()
storage = Storage(...)
files = storage.get_files(...)
storage.kill_threads()
- No threads (no cleanup necessary)
storage = Storage(..., n_threads=0)
We tested get_files
on a dual core 2014 Macbook Pro, 2.4 GHz on a decent wireless connection with SSD.
The version tested was commit 26b3606240ca66d7dbe6def33aab4dba7bb316be
Service | Threads | Time (sec) |
---|---|---|
file | 0 | 0.0036 |
file | 2 | 0.0039 |
file | 4 | 0.0037 |
file | 8 | 0.0053 |
file | 16 | 0.0045 |
file | 32 | 0.0058 |
file | 64 | 0.0070 |
gs | 0 | 27.8455 |
gs | 1 | 10.5758 |
gs | 2 | 4.9513 |
gs | 4 | 2.5868 |
gs | 8 | 1.4941 |
gs | 16 | 0.9418 |
gs | 32 | 0.7500 |
gs | 64 | 0.6997 |
S3 | 0 | 10.0914 |
S3 | 1 | 1.6661 |
S3 | 2 | 0.9482 |
S3 | 4 | 0.6604 |
S3 | 8 | 0.5300 |
S3 | 16 | 0.2337 |
S3 | 32 | 0.2419 |
S3 | 64 | 0.4772 |
The code used to generate the tests is listed below. The command to run the test is:
py.test -s -v python/test/test_storage.py
def test_performance():
def run(url, num_threads):
s = Storage(url, n_threads=num_threads)
content = 'some_string'
s.put_file('info', content, compress=False)
s.wait_until_queue_empty()
start = time.time()
s.get_files([ 'info' for i in xrange(50) ])
end = time.time()
s._kill_threads()
return end - start
urls = [
"file:///tmp/removeme/read_write",
"gs://neuroglancer/removeme/read_write",
"s3://neuroglancer/removeme/read_write"
]
for url in urls:
n_threads = [ 0 ] + [ 2 ** i for i in xrange(0,7) ]
for num in n_threads:
delta = run(url, num)
print url, num, delta