-
Notifications
You must be signed in to change notification settings - Fork 111
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Optimized FreeList allocator (#15536)
### Ticket [Link to Github Issue](#15535) ### Problem description The current FreeList allocator is fairly slow. Epically under my GGML backend, it becomes unusable slow under high memory pressure. This can be mitigated by using an better optimized allocator. Detailed benchmark can be found my GitHub repo: https://github.com/marty1885/tt-alloc-opt. In short, the new allocator is over an order of magnitude faster (~60x) for large amounts of allocations (DRAM) and at least 2x on smaller allocations (L1). I have tested the new allocator it works against my GGML backend on GS. Due to networking issues, I don't have WH tests ready. I can't until after Thanksgiving. But still want to open the PR and start the discussion. Parameters like table sizes and base values most likely needs to be tuned for real world workloads. Even without changes, it should be faster then the existing one by default. ``` 2024-11-28T11:09:30+08:00 Running ./tt-alloc-opt Run on (16 X 5132 MHz CPU s) CPU Caches: L1 Data 32 KiB (x8) L1 Instruction 32 KiB (x8) L2 Unified 1024 KiB (x8) L3 Unified 16384 KiB (x1) Load Average: 1.02, 0.90, 0.84 ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. ------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------------------- FreeListOpt/WorstCase 156211 ns 155689 ns 4428 FreeListOpt/MixedAllocations 59938 ns 59760 ns 11755 FreeListOpt/TypicalCase 32595 ns 32492 ns 21033 FreeListOpt/Small 3676 ns 3663 ns 179884 FreeListOpt/GetAvailableAddresses 409 ns 407 ns 1726829 FreeListOpt/Statistics 379 ns 378 ns 1848454 FreeListOpt/ShrinkReset 11.7 ns 11.6 ns 60177702 FreeList[BestMatch]/WorstCase 8430132 ns 8396434 ns 81 FreeList[BestMatch]/MixedAllocations 2219312 ns 2211079 ns 314 FreeList[BestMatch]/TypicalCase 863138 ns 860375 ns 802 FreeList[BestMatch]/Small 7968 ns 7944 ns 85656 FreeList[BestMatch]/GetAvailableAddresses 893 ns 890 ns 779186 FreeList[BestMatch]/Statistics 1107 ns 1103 ns 643461 FreeList[BestMatch]/ShrinkReset 3.44 ns 3.43 ns 204268507 FreeList[FirstMatch]/WorstCase 7409143 ns 7380524 ns 95 FreeList[FirstMatch]/MixedAllocations 2283444 ns 2274426 ns 296 FreeList[FirstMatch]/TypicalCase 864990 ns 861908 ns 809 FreeList[FirstMatch]/Small 7726 ns 7705 ns 91195 FreeList[FirstMatch]/GetAvailableAddresses 886 ns 882 ns 794174 FreeList[FirstMatch]/Statistics 1092 ns 1088 ns 645254 FreeList[FirstMatch]/ShrinkReset 3.47 ns 3.45 ns 202517500 ``` The idea of a segregated list comes from the [TLSF](http://www.gii.upv.es/tlsf/files/papers/ecrts04_tlsf.pdf) algorithm. However the intention is not to implement TLSF nor to make a real time allocator. To keep the behavior the same as the exiting FreeList implementation (L1 is small, reducing fragmentation is important) - 1) a full scan of blocks in the size class is done to determine the best place to allocate the new memory. 2) only one layer exists in this implementation's segregation list instead of two. As my synthetic benchmark indicating cache locality is more important then reduced number of blocks scanned. Though, it can be easily converted to a proper 2 layer design by changing how the index is calculated. I have WIP optimizations that should provide an extra 5~10% throughput. But again, I wat to start the discussion early as a new allocator is a big change. For license concerns. TT can have the file licensed under Apache 2.0. I licensed the version under 0BSD in my repo to make the process easy. ### What's changed * Implemented a new `FreeListOpt` allocator * Same algorithm as the existing `FreeList` allocator under BEST mode * Produces the same results * Removes the use of linked list and shared_ptr. Everything is in SoA to maximize cache efficency * Table to store free blocks by size so large allocations won't need to look at smaller blocks * Hash table to convert addresses back into blocks. No more list walk * Same memory coalescing as the existing allocator * Metadata blocks are reused internally to minimize memory allocation * Tests for the new allocator. * `dump_block` new accepts `ostream` instead of `ofstream` ### Checklist - [ ] Post commit CI passes - [ ] Blackhole Post commit (if applicable) - [ ] Model regression CI testing passes (if applicable) - [ ] Device performance regression CI testing passes (if applicable) - [ ] New/Existing tests provide coverage for changes
- Loading branch information
Showing
9 changed files
with
1,108 additions
and
8 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
342 changes: 342 additions & 0 deletions
342
tests/tt_metal/tt_metal/api/allocator/test_free_list_opt_allocator.cpp
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,342 @@ | ||
// SPDX-FileCopyrightText: © 2024 Tenstorrent Inc. | ||
// | ||
// SPDX-License-Identifier: Apache-2.0 | ||
|
||
#include <gtest/gtest.h> | ||
#include "allocator/allocator.hpp" | ||
#include "tt_metal/impl/allocator/algorithms/free_list_opt.hpp" | ||
|
||
// UDL to convert integer literals to SI units | ||
constexpr size_t operator"" _KiB(unsigned long long x) { return x * 1024; } | ||
constexpr size_t operator"" _MiB(unsigned long long x) { return x * 1024 * 1024; } | ||
constexpr size_t operator"" _GiB(unsigned long long x) { return x * 1024 * 1024 * 1024; } | ||
|
||
TEST(FreeListOptTest, Allocation) { | ||
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB); | ||
auto a = allocator.allocate(1_KiB); | ||
ASSERT_TRUE(a.has_value()); | ||
ASSERT_EQ(a.value(), 0); | ||
|
||
auto b = allocator.allocate(1_KiB); | ||
ASSERT_TRUE(b.has_value()); | ||
ASSERT_EQ(b.value(), 1_KiB); | ||
} | ||
|
||
TEST(FreeListOptTest, Alignment) { | ||
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1, 1_KiB); | ||
auto a = allocator.allocate(64); | ||
ASSERT_TRUE(a.has_value()); | ||
ASSERT_EQ(a.value(), 0); | ||
auto b = allocator.allocate(64); | ||
ASSERT_TRUE(b.has_value()); | ||
ASSERT_EQ(b.value(), 1_KiB); | ||
} | ||
|
||
TEST(FreeListOptTest, MinAllocationSize) { | ||
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1); | ||
auto a = allocator.allocate(1); | ||
ASSERT_TRUE(a.has_value()); | ||
ASSERT_EQ(a.value(), 0); | ||
auto b = allocator.allocate(1); | ||
ASSERT_TRUE(b.has_value()); | ||
ASSERT_EQ(b.value(), 1_KiB); | ||
} | ||
|
||
TEST(FreeListOptTest, Clear) { | ||
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB); | ||
auto a = allocator.allocate(1_KiB); | ||
auto b = allocator.allocate(1_KiB); | ||
ASSERT_TRUE(a.has_value()); | ||
ASSERT_TRUE(b.has_value()); | ||
allocator.clear(); | ||
auto c = allocator.allocate(1_KiB); | ||
ASSERT_TRUE(c.has_value()); | ||
ASSERT_EQ(c.value(), 0); | ||
} | ||
|
||
TEST(FreeListOptTest, AllocationAndDeallocation) { | ||
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB); | ||
std::vector<std::optional<tt::tt_metal::DeviceAddr>> allocations(10); | ||
|
||
// Deallocate in order | ||
for(size_t i = 0; i < allocations.size(); i++) { | ||
allocations[i] = allocator.allocate(1_KiB); | ||
ASSERT_TRUE(allocations[i].has_value()); | ||
} | ||
|
||
for(size_t i = allocations.size(); i > 0; i--) { | ||
allocator.deallocate(allocations[i - 1].value()); | ||
} | ||
|
||
// Deallocate in reverse order | ||
for(size_t i = 0; i < allocations.size(); i++) { | ||
allocations[i] = allocator.allocate(1_KiB); | ||
ASSERT_TRUE(allocations[i].has_value()); | ||
} | ||
|
||
for(size_t i = 0; i < allocations.size(); i++) { | ||
allocator.deallocate(allocations[i].value()); | ||
} | ||
} | ||
|
||
TEST(FreeListOptTest, AllocateAtAddress) { | ||
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB); | ||
auto a = allocator.allocate(1_KiB); | ||
ASSERT_TRUE(a.has_value()); | ||
ASSERT_EQ(a.value(), 0); | ||
|
||
auto b = allocator.allocate_at_address(1_KiB, 1_KiB); | ||
ASSERT_TRUE(b.has_value()); | ||
ASSERT_EQ(b.value(), 1_KiB); | ||
|
||
// Address is already allocated | ||
auto c = allocator.allocate_at_address(1_KiB, 1_KiB); | ||
ASSERT_FALSE(c.has_value()); | ||
|
||
auto d = allocator.allocate_at_address(2_KiB, 1_KiB); | ||
ASSERT_TRUE(d.has_value()); | ||
ASSERT_EQ(d.value(), 2_KiB); | ||
|
||
allocator.deallocate(a.value()); | ||
auto e = allocator.allocate_at_address(0, 1_KiB); | ||
ASSERT_TRUE(e.has_value()); | ||
ASSERT_EQ(e.value(), 0); | ||
} | ||
|
||
TEST(FreeListOptTest, AllocateAtAddressInteractions) { | ||
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB); | ||
auto wedge = allocator.allocate_at_address(32_KiB, 1_KiB); | ||
|
||
auto a = allocator.allocate(1_KiB); | ||
ASSERT_TRUE(a.has_value()); | ||
ASSERT_EQ(a.value(), 0); | ||
|
||
auto z = allocator.allocate(1_KiB, false); | ||
ASSERT_TRUE(z.has_value()); | ||
ASSERT_EQ(z.value(), 32_KiB - 1_KiB); // Counterintuitive, but because we use BestFit, it will find the smaller block at the beginning | ||
|
||
auto b = allocator.allocate(1_KiB); | ||
ASSERT_TRUE(b.has_value()); | ||
ASSERT_EQ(b.value(), 1_KiB); | ||
} | ||
|
||
TEST(FreeListOptTest, ShrinkAndReset) { | ||
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB); | ||
auto a = allocator.allocate(1_KiB); | ||
auto b = allocator.allocate(1_KiB); | ||
ASSERT_TRUE(a.has_value()); | ||
ASSERT_TRUE(b.has_value()); | ||
allocator.deallocate(a.value()); | ||
|
||
allocator.shrink_size(1_KiB); | ||
auto c = allocator.allocate_at_address(0, 1_KiB); | ||
ASSERT_FALSE(c.has_value()); | ||
|
||
auto d = allocator.allocate_at_address(1_KiB, 1_KiB); | ||
ASSERT_FALSE(d.has_value()); | ||
|
||
allocator.reset_size(); | ||
allocator.deallocate(b.value()); | ||
|
||
auto e = allocator.allocate(2_KiB); | ||
ASSERT_TRUE(e.has_value()); | ||
} | ||
|
||
TEST(FreeListOptTest, Statistics) { | ||
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB); | ||
auto a = allocator.allocate(1_KiB); | ||
auto b = allocator.allocate(1_KiB); | ||
ASSERT_TRUE(a.has_value()); | ||
ASSERT_TRUE(b.has_value()); | ||
allocator.deallocate(a.value()); | ||
|
||
auto stats = allocator.get_statistics(); | ||
ASSERT_EQ(stats.total_allocated_bytes, 1_KiB); | ||
} | ||
|
||
TEST(FreeListOptTest, AllocateFromTop) { | ||
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB); | ||
auto a = allocator.allocate(1_KiB, false); | ||
ASSERT_TRUE(a.has_value()); | ||
ASSERT_EQ(a.value(), 1_GiB - 1_KiB); | ||
|
||
auto b = allocator.allocate(1_KiB, false); | ||
ASSERT_TRUE(b.has_value()); | ||
ASSERT_EQ(b.value(), 1_GiB - 2_KiB); | ||
|
||
auto c = allocator.allocate(1_KiB); | ||
ASSERT_TRUE(c.has_value()); | ||
ASSERT_EQ(c.value(), 0); | ||
} | ||
|
||
TEST(FreeListOptTest, Coalescing) { | ||
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB); | ||
auto a = allocator.allocate(1_KiB); | ||
auto b = allocator.allocate(1_KiB); | ||
auto c = allocator.allocate(1_KiB); | ||
ASSERT_TRUE(a.has_value()); | ||
ASSERT_TRUE(b.has_value()); | ||
ASSERT_TRUE(c.has_value()); | ||
allocator.deallocate(b.value()); | ||
allocator.deallocate(a.value()); | ||
|
||
auto d = allocator.allocate(2_KiB); | ||
ASSERT_TRUE(d.has_value()); | ||
ASSERT_EQ(d.value(), 0); | ||
} | ||
|
||
TEST(FreeListOptTest, CoalescingAfterResetShrink) { | ||
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB); | ||
auto a = allocator.allocate(1_KiB); | ||
auto b = allocator.allocate(1_KiB); | ||
auto c = allocator.allocate(1_KiB); | ||
ASSERT_TRUE(a.has_value()); | ||
ASSERT_TRUE(b.has_value()); | ||
ASSERT_TRUE(c.has_value()); | ||
allocator.deallocate(b.value()); | ||
allocator.deallocate(a.value()); | ||
|
||
allocator.shrink_size(1_KiB); | ||
auto d = allocator.allocate(2_KiB); | ||
allocator.reset_size(); | ||
auto e = allocator.allocate(2_KiB); | ||
ASSERT_TRUE(e.has_value()); | ||
ASSERT_EQ(e.value(), 0); | ||
} | ||
|
||
TEST(FreeListOptTest, OutOfMemory) { | ||
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB); | ||
auto a = allocator.allocate(1_GiB); | ||
ASSERT_TRUE(a.has_value()); | ||
auto b = allocator.allocate(1_KiB); | ||
ASSERT_FALSE(b.has_value()); | ||
|
||
allocator.clear(); | ||
auto c = allocator.allocate(1_GiB - 1_KiB); | ||
ASSERT_TRUE(c.has_value()); | ||
auto d = allocator.allocate(2_KiB); | ||
ASSERT_FALSE(d.has_value()); | ||
} | ||
|
||
TEST(FreeListOptTest, AvailableAddresses) { | ||
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB); | ||
auto a = allocator.allocate(1_KiB); | ||
auto aval = allocator.available_addresses(1_KiB); | ||
ASSERT_EQ(aval.size(), 1); | ||
ASSERT_EQ(aval[0].first, 1_KiB); // Start address | ||
ASSERT_EQ(aval[0].second, 1_GiB); // End address | ||
allocator.clear(); | ||
|
||
a = allocator.allocate(1_KiB); | ||
auto b = allocator.allocate(1_KiB); | ||
auto c = allocator.allocate(1_KiB); | ||
ASSERT_TRUE(a.has_value()); | ||
ASSERT_EQ(a.value(), 0); | ||
ASSERT_TRUE(b.has_value()); | ||
ASSERT_EQ(b.value(), 1_KiB); | ||
ASSERT_TRUE(c.has_value()); | ||
ASSERT_EQ(c.value(), 2_KiB); | ||
allocator.deallocate(b.value()); | ||
aval = allocator.available_addresses(1_KiB); | ||
ASSERT_EQ(aval.size(), 2); | ||
ASSERT_EQ(aval[0].first, 1_KiB); // Start address | ||
ASSERT_EQ(aval[0].second, 2_KiB); // End address | ||
ASSERT_EQ(aval[1].first, 3_KiB); // Start address | ||
ASSERT_EQ(aval[1].second, 1_GiB); // End address | ||
|
||
allocator.clear(); | ||
a = allocator.allocate(1_KiB); | ||
b = allocator.allocate(1_KiB); | ||
c = allocator.allocate(1_KiB); | ||
ASSERT_TRUE(a.has_value()); | ||
ASSERT_EQ(a.value(), 0); | ||
ASSERT_TRUE(b.has_value()); | ||
ASSERT_EQ(b.value(), 1_KiB); | ||
ASSERT_TRUE(c.has_value()); | ||
ASSERT_EQ(c.value(), 2_KiB); | ||
allocator.deallocate(b.value()); | ||
aval = allocator.available_addresses(10_KiB); | ||
ASSERT_EQ(aval.size(), 1); | ||
ASSERT_EQ(aval[0].first, 3_KiB); // Start address | ||
ASSERT_EQ(aval[0].second, 1_GiB); // End address | ||
} | ||
|
||
TEST(FreeListOptTest, LowestOccupiedAddress) { | ||
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB); | ||
auto a = allocator.allocate(1_KiB); | ||
auto b = allocator.allocate(1_KiB); | ||
auto c = allocator.allocate(1_KiB); | ||
ASSERT_TRUE(a.has_value()); | ||
ASSERT_EQ(a.value(), 0); | ||
ASSERT_TRUE(b.has_value()); | ||
ASSERT_EQ(b.value(), 1_KiB); | ||
ASSERT_TRUE(c.has_value()); | ||
ASSERT_EQ(c.value(), 2_KiB); | ||
auto loa = allocator.lowest_occupied_address(); | ||
ASSERT_EQ(loa.value(), 0); | ||
allocator.deallocate(a.value()); | ||
loa = allocator.lowest_occupied_address(); | ||
ASSERT_EQ(loa.value(), 1_KiB); | ||
allocator.deallocate(b.value()); | ||
loa = allocator.lowest_occupied_address(); | ||
ASSERT_EQ(loa.value(), 2_KiB); | ||
allocator.deallocate(c.value()); | ||
loa = allocator.lowest_occupied_address(); | ||
ASSERT_FALSE(loa.has_value()); | ||
} | ||
|
||
TEST(FreeListOptTest, LowestOccupiedAddressWithAllocateAt) { | ||
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB); | ||
auto a = allocator.allocate_at_address(1_KiB, 1_KiB); | ||
ASSERT_TRUE(a.has_value()); | ||
ASSERT_EQ(a.value(), 1_KiB); | ||
auto loa = allocator.lowest_occupied_address(); | ||
ASSERT_EQ(loa.value(), 1_KiB); | ||
allocator.deallocate(a.value()); | ||
loa = allocator.lowest_occupied_address(); | ||
ASSERT_FALSE(loa.has_value()); | ||
} | ||
|
||
TEST(FreeListOptTest, FirstFit) { | ||
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB, tt::tt_metal::allocator::FreeListOpt::SearchPolicy::FIRST); | ||
auto a = allocator.allocate(1_KiB); | ||
auto b = allocator.allocate(3_KiB); | ||
auto c = allocator.allocate(1_KiB); | ||
auto d = allocator.allocate(1_KiB); | ||
auto e = allocator.allocate(1_KiB); | ||
|
||
ASSERT_TRUE(a.has_value()); | ||
ASSERT_EQ(a.value(), 0); | ||
ASSERT_TRUE(b.has_value()); | ||
ASSERT_EQ(b.value(), 1_KiB); | ||
ASSERT_TRUE(c.has_value()); | ||
ASSERT_EQ(c.value(), 4_KiB); | ||
ASSERT_TRUE(d.has_value()); | ||
ASSERT_EQ(d.value(), 5_KiB); | ||
ASSERT_TRUE(e.has_value()); | ||
ASSERT_EQ(e.value(), 6_KiB); | ||
|
||
allocator.deallocate(b.value()); | ||
allocator.deallocate(d.value()); | ||
|
||
auto f = allocator.allocate(1_KiB); | ||
ASSERT_TRUE(f.has_value()); | ||
ASSERT_EQ(f.value(), 1_KiB); | ||
} | ||
|
||
TEST(FreeListOptTest, FirstFitAllocateAtAddressInteractions) { | ||
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB, tt::tt_metal::allocator::FreeListOpt::SearchPolicy::FIRST); | ||
auto wedge = allocator.allocate_at_address(32_KiB, 1_KiB); | ||
|
||
auto a = allocator.allocate(1_KiB); | ||
ASSERT_TRUE(a.has_value()); | ||
ASSERT_EQ(a.value(), 0); | ||
|
||
auto z = allocator.allocate(1_KiB, false); | ||
ASSERT_TRUE(z.has_value()); | ||
ASSERT_EQ(z.value(), 1_GiB - 1_KiB); | ||
|
||
auto b = allocator.allocate(1_KiB); | ||
ASSERT_TRUE(b.has_value()); | ||
ASSERT_EQ(b.value(), 1_KiB); | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.