Skip to content

Commit

Permalink
Optimized FreeList allocator (#15536)
Browse files Browse the repository at this point in the history
### Ticket
[Link to Github
Issue](#15535)

### Problem description
The current FreeList allocator is fairly slow. Epically under my GGML
backend, it becomes unusable slow under high memory pressure. This can
be mitigated by using an better optimized allocator.

Detailed benchmark can be found my GitHub repo:
https://github.com/marty1885/tt-alloc-opt. In short, the new allocator
is over an order of magnitude faster (~60x) for large amounts of
allocations (DRAM) and at least 2x on smaller allocations (L1). I have
tested the new allocator it works against my GGML backend on GS.

Due to networking issues, I don't have WH tests ready. I can't until
after Thanksgiving. But still want to open the PR and start the
discussion.

Parameters like table sizes and base values most likely needs to be
tuned for real world workloads. Even without changes, it should be
faster then the existing one by default.

```
2024-11-28T11:09:30+08:00
Running ./tt-alloc-opt
Run on (16 X 5132 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1024 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 1.02, 0.90, 0.84
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------------
Benchmark                                           Time             CPU   Iterations
-------------------------------------------------------------------------------------
FreeListOpt/WorstCase                          156211 ns       155689 ns         4428
FreeListOpt/MixedAllocations                    59938 ns        59760 ns        11755
FreeListOpt/TypicalCase                         32595 ns        32492 ns        21033
FreeListOpt/Small                                3676 ns         3663 ns       179884
FreeListOpt/GetAvailableAddresses                 409 ns          407 ns      1726829
FreeListOpt/Statistics                            379 ns          378 ns      1848454
FreeListOpt/ShrinkReset                          11.7 ns         11.6 ns     60177702
FreeList[BestMatch]/WorstCase                 8430132 ns      8396434 ns           81
FreeList[BestMatch]/MixedAllocations          2219312 ns      2211079 ns          314
FreeList[BestMatch]/TypicalCase                863138 ns       860375 ns          802
FreeList[BestMatch]/Small                        7968 ns         7944 ns        85656
FreeList[BestMatch]/GetAvailableAddresses         893 ns          890 ns       779186
FreeList[BestMatch]/Statistics                   1107 ns         1103 ns       643461
FreeList[BestMatch]/ShrinkReset                  3.44 ns         3.43 ns    204268507
FreeList[FirstMatch]/WorstCase                7409143 ns      7380524 ns           95
FreeList[FirstMatch]/MixedAllocations         2283444 ns      2274426 ns          296
FreeList[FirstMatch]/TypicalCase               864990 ns       861908 ns          809
FreeList[FirstMatch]/Small                       7726 ns         7705 ns        91195
FreeList[FirstMatch]/GetAvailableAddresses        886 ns          882 ns       794174
FreeList[FirstMatch]/Statistics                  1092 ns         1088 ns       645254
FreeList[FirstMatch]/ShrinkReset                 3.47 ns         3.45 ns    202517500
```

The idea of a segregated list comes from the
[TLSF](http://www.gii.upv.es/tlsf/files/papers/ecrts04_tlsf.pdf)
algorithm. However the intention is not to implement TLSF nor to make a
real time allocator. To keep the behavior the same as the exiting
FreeList implementation (L1 is small, reducing fragmentation is
important) - 1) a full scan of blocks in the size class is done to
determine the best place to allocate the new memory. 2) only one layer
exists in this implementation's segregation list instead of two. As my
synthetic benchmark indicating cache locality is more important then
reduced number of blocks scanned. Though, it can be easily converted to
a proper 2 layer design by changing how the index is calculated.

I have WIP optimizations that should provide an extra 5~10% throughput.
But again, I wat to start the discussion early as a new allocator is a
big change.

For license concerns. TT can have the file licensed under Apache 2.0. I
licensed the version under 0BSD in my repo to make the process easy.

### What's changed

* Implemented a new `FreeListOpt` allocator
  * Same algorithm as the existing `FreeList` allocator under BEST mode
    * Produces the same results
* Removes the use of linked list and shared_ptr. Everything is in SoA to
maximize cache efficency
* Table to store free blocks by size so large allocations won't need to
look at smaller blocks
  * Hash table to convert addresses back into blocks. No more list walk
  * Same memory coalescing as the existing allocator
  * Metadata blocks are reused internally to minimize memory allocation
* Tests for the new allocator.
* `dump_block` new accepts `ostream` instead of `ofstream`

### Checklist
- [ ] Post commit CI passes
- [ ] Blackhole Post commit (if applicable)
- [ ] Model regression CI testing passes (if applicable)
- [ ] Device performance regression CI testing passes (if applicable)
- [ ] New/Existing tests provide coverage for changes
  • Loading branch information
marty1885 authored Dec 19, 2024
1 parent fa779b9 commit e882125
Show file tree
Hide file tree
Showing 9 changed files with 1,108 additions and 8 deletions.
1 change: 1 addition & 0 deletions tests/tt_metal/tt_metal/api/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
set(UNIT_TESTS_API_SRC
${CMAKE_CURRENT_SOURCE_DIR}/allocator/test_free_list_allocator.cpp
${CMAKE_CURRENT_SOURCE_DIR}/allocator/test_free_list_opt_allocator.cpp
${CMAKE_CURRENT_SOURCE_DIR}/allocator/test_l1_banking_allocator.cpp
${CMAKE_CURRENT_SOURCE_DIR}/circular_buffer/test_CircularBuffer_allocation.cpp
${CMAKE_CURRENT_SOURCE_DIR}/circular_buffer/test_CircularBuffer_creation.cpp
Expand Down
342 changes: 342 additions & 0 deletions tests/tt_metal/tt_metal/api/allocator/test_free_list_opt_allocator.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,342 @@
// SPDX-FileCopyrightText: © 2024 Tenstorrent Inc.
//
// SPDX-License-Identifier: Apache-2.0

#include <gtest/gtest.h>
#include "allocator/allocator.hpp"
#include "tt_metal/impl/allocator/algorithms/free_list_opt.hpp"

// UDL to convert integer literals to SI units
constexpr size_t operator"" _KiB(unsigned long long x) { return x * 1024; }
constexpr size_t operator"" _MiB(unsigned long long x) { return x * 1024 * 1024; }
constexpr size_t operator"" _GiB(unsigned long long x) { return x * 1024 * 1024 * 1024; }

TEST(FreeListOptTest, Allocation) {
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB);
auto a = allocator.allocate(1_KiB);
ASSERT_TRUE(a.has_value());
ASSERT_EQ(a.value(), 0);

auto b = allocator.allocate(1_KiB);
ASSERT_TRUE(b.has_value());
ASSERT_EQ(b.value(), 1_KiB);
}

TEST(FreeListOptTest, Alignment) {
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1, 1_KiB);
auto a = allocator.allocate(64);
ASSERT_TRUE(a.has_value());
ASSERT_EQ(a.value(), 0);
auto b = allocator.allocate(64);
ASSERT_TRUE(b.has_value());
ASSERT_EQ(b.value(), 1_KiB);
}

TEST(FreeListOptTest, MinAllocationSize) {
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1);
auto a = allocator.allocate(1);
ASSERT_TRUE(a.has_value());
ASSERT_EQ(a.value(), 0);
auto b = allocator.allocate(1);
ASSERT_TRUE(b.has_value());
ASSERT_EQ(b.value(), 1_KiB);
}

TEST(FreeListOptTest, Clear) {
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB);
auto a = allocator.allocate(1_KiB);
auto b = allocator.allocate(1_KiB);
ASSERT_TRUE(a.has_value());
ASSERT_TRUE(b.has_value());
allocator.clear();
auto c = allocator.allocate(1_KiB);
ASSERT_TRUE(c.has_value());
ASSERT_EQ(c.value(), 0);
}

TEST(FreeListOptTest, AllocationAndDeallocation) {
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB);
std::vector<std::optional<tt::tt_metal::DeviceAddr>> allocations(10);

// Deallocate in order
for(size_t i = 0; i < allocations.size(); i++) {
allocations[i] = allocator.allocate(1_KiB);
ASSERT_TRUE(allocations[i].has_value());
}

for(size_t i = allocations.size(); i > 0; i--) {
allocator.deallocate(allocations[i - 1].value());
}

// Deallocate in reverse order
for(size_t i = 0; i < allocations.size(); i++) {
allocations[i] = allocator.allocate(1_KiB);
ASSERT_TRUE(allocations[i].has_value());
}

for(size_t i = 0; i < allocations.size(); i++) {
allocator.deallocate(allocations[i].value());
}
}

TEST(FreeListOptTest, AllocateAtAddress) {
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB);
auto a = allocator.allocate(1_KiB);
ASSERT_TRUE(a.has_value());
ASSERT_EQ(a.value(), 0);

auto b = allocator.allocate_at_address(1_KiB, 1_KiB);
ASSERT_TRUE(b.has_value());
ASSERT_EQ(b.value(), 1_KiB);

// Address is already allocated
auto c = allocator.allocate_at_address(1_KiB, 1_KiB);
ASSERT_FALSE(c.has_value());

auto d = allocator.allocate_at_address(2_KiB, 1_KiB);
ASSERT_TRUE(d.has_value());
ASSERT_EQ(d.value(), 2_KiB);

allocator.deallocate(a.value());
auto e = allocator.allocate_at_address(0, 1_KiB);
ASSERT_TRUE(e.has_value());
ASSERT_EQ(e.value(), 0);
}

TEST(FreeListOptTest, AllocateAtAddressInteractions) {
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB);
auto wedge = allocator.allocate_at_address(32_KiB, 1_KiB);

auto a = allocator.allocate(1_KiB);
ASSERT_TRUE(a.has_value());
ASSERT_EQ(a.value(), 0);

auto z = allocator.allocate(1_KiB, false);
ASSERT_TRUE(z.has_value());
ASSERT_EQ(z.value(), 32_KiB - 1_KiB); // Counterintuitive, but because we use BestFit, it will find the smaller block at the beginning

auto b = allocator.allocate(1_KiB);
ASSERT_TRUE(b.has_value());
ASSERT_EQ(b.value(), 1_KiB);
}

TEST(FreeListOptTest, ShrinkAndReset) {
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB);
auto a = allocator.allocate(1_KiB);
auto b = allocator.allocate(1_KiB);
ASSERT_TRUE(a.has_value());
ASSERT_TRUE(b.has_value());
allocator.deallocate(a.value());

allocator.shrink_size(1_KiB);
auto c = allocator.allocate_at_address(0, 1_KiB);
ASSERT_FALSE(c.has_value());

auto d = allocator.allocate_at_address(1_KiB, 1_KiB);
ASSERT_FALSE(d.has_value());

allocator.reset_size();
allocator.deallocate(b.value());

auto e = allocator.allocate(2_KiB);
ASSERT_TRUE(e.has_value());
}

TEST(FreeListOptTest, Statistics) {
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB);
auto a = allocator.allocate(1_KiB);
auto b = allocator.allocate(1_KiB);
ASSERT_TRUE(a.has_value());
ASSERT_TRUE(b.has_value());
allocator.deallocate(a.value());

auto stats = allocator.get_statistics();
ASSERT_EQ(stats.total_allocated_bytes, 1_KiB);
}

TEST(FreeListOptTest, AllocateFromTop) {
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB);
auto a = allocator.allocate(1_KiB, false);
ASSERT_TRUE(a.has_value());
ASSERT_EQ(a.value(), 1_GiB - 1_KiB);

auto b = allocator.allocate(1_KiB, false);
ASSERT_TRUE(b.has_value());
ASSERT_EQ(b.value(), 1_GiB - 2_KiB);

auto c = allocator.allocate(1_KiB);
ASSERT_TRUE(c.has_value());
ASSERT_EQ(c.value(), 0);
}

TEST(FreeListOptTest, Coalescing) {
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB);
auto a = allocator.allocate(1_KiB);
auto b = allocator.allocate(1_KiB);
auto c = allocator.allocate(1_KiB);
ASSERT_TRUE(a.has_value());
ASSERT_TRUE(b.has_value());
ASSERT_TRUE(c.has_value());
allocator.deallocate(b.value());
allocator.deallocate(a.value());

auto d = allocator.allocate(2_KiB);
ASSERT_TRUE(d.has_value());
ASSERT_EQ(d.value(), 0);
}

TEST(FreeListOptTest, CoalescingAfterResetShrink) {
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB);
auto a = allocator.allocate(1_KiB);
auto b = allocator.allocate(1_KiB);
auto c = allocator.allocate(1_KiB);
ASSERT_TRUE(a.has_value());
ASSERT_TRUE(b.has_value());
ASSERT_TRUE(c.has_value());
allocator.deallocate(b.value());
allocator.deallocate(a.value());

allocator.shrink_size(1_KiB);
auto d = allocator.allocate(2_KiB);
allocator.reset_size();
auto e = allocator.allocate(2_KiB);
ASSERT_TRUE(e.has_value());
ASSERT_EQ(e.value(), 0);
}

TEST(FreeListOptTest, OutOfMemory) {
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB);
auto a = allocator.allocate(1_GiB);
ASSERT_TRUE(a.has_value());
auto b = allocator.allocate(1_KiB);
ASSERT_FALSE(b.has_value());

allocator.clear();
auto c = allocator.allocate(1_GiB - 1_KiB);
ASSERT_TRUE(c.has_value());
auto d = allocator.allocate(2_KiB);
ASSERT_FALSE(d.has_value());
}

TEST(FreeListOptTest, AvailableAddresses) {
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB);
auto a = allocator.allocate(1_KiB);
auto aval = allocator.available_addresses(1_KiB);
ASSERT_EQ(aval.size(), 1);
ASSERT_EQ(aval[0].first, 1_KiB); // Start address
ASSERT_EQ(aval[0].second, 1_GiB); // End address
allocator.clear();

a = allocator.allocate(1_KiB);
auto b = allocator.allocate(1_KiB);
auto c = allocator.allocate(1_KiB);
ASSERT_TRUE(a.has_value());
ASSERT_EQ(a.value(), 0);
ASSERT_TRUE(b.has_value());
ASSERT_EQ(b.value(), 1_KiB);
ASSERT_TRUE(c.has_value());
ASSERT_EQ(c.value(), 2_KiB);
allocator.deallocate(b.value());
aval = allocator.available_addresses(1_KiB);
ASSERT_EQ(aval.size(), 2);
ASSERT_EQ(aval[0].first, 1_KiB); // Start address
ASSERT_EQ(aval[0].second, 2_KiB); // End address
ASSERT_EQ(aval[1].first, 3_KiB); // Start address
ASSERT_EQ(aval[1].second, 1_GiB); // End address

allocator.clear();
a = allocator.allocate(1_KiB);
b = allocator.allocate(1_KiB);
c = allocator.allocate(1_KiB);
ASSERT_TRUE(a.has_value());
ASSERT_EQ(a.value(), 0);
ASSERT_TRUE(b.has_value());
ASSERT_EQ(b.value(), 1_KiB);
ASSERT_TRUE(c.has_value());
ASSERT_EQ(c.value(), 2_KiB);
allocator.deallocate(b.value());
aval = allocator.available_addresses(10_KiB);
ASSERT_EQ(aval.size(), 1);
ASSERT_EQ(aval[0].first, 3_KiB); // Start address
ASSERT_EQ(aval[0].second, 1_GiB); // End address
}

TEST(FreeListOptTest, LowestOccupiedAddress) {
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB);
auto a = allocator.allocate(1_KiB);
auto b = allocator.allocate(1_KiB);
auto c = allocator.allocate(1_KiB);
ASSERT_TRUE(a.has_value());
ASSERT_EQ(a.value(), 0);
ASSERT_TRUE(b.has_value());
ASSERT_EQ(b.value(), 1_KiB);
ASSERT_TRUE(c.has_value());
ASSERT_EQ(c.value(), 2_KiB);
auto loa = allocator.lowest_occupied_address();
ASSERT_EQ(loa.value(), 0);
allocator.deallocate(a.value());
loa = allocator.lowest_occupied_address();
ASSERT_EQ(loa.value(), 1_KiB);
allocator.deallocate(b.value());
loa = allocator.lowest_occupied_address();
ASSERT_EQ(loa.value(), 2_KiB);
allocator.deallocate(c.value());
loa = allocator.lowest_occupied_address();
ASSERT_FALSE(loa.has_value());
}

TEST(FreeListOptTest, LowestOccupiedAddressWithAllocateAt) {
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB);
auto a = allocator.allocate_at_address(1_KiB, 1_KiB);
ASSERT_TRUE(a.has_value());
ASSERT_EQ(a.value(), 1_KiB);
auto loa = allocator.lowest_occupied_address();
ASSERT_EQ(loa.value(), 1_KiB);
allocator.deallocate(a.value());
loa = allocator.lowest_occupied_address();
ASSERT_FALSE(loa.has_value());
}

TEST(FreeListOptTest, FirstFit) {
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB, tt::tt_metal::allocator::FreeListOpt::SearchPolicy::FIRST);
auto a = allocator.allocate(1_KiB);
auto b = allocator.allocate(3_KiB);
auto c = allocator.allocate(1_KiB);
auto d = allocator.allocate(1_KiB);
auto e = allocator.allocate(1_KiB);

ASSERT_TRUE(a.has_value());
ASSERT_EQ(a.value(), 0);
ASSERT_TRUE(b.has_value());
ASSERT_EQ(b.value(), 1_KiB);
ASSERT_TRUE(c.has_value());
ASSERT_EQ(c.value(), 4_KiB);
ASSERT_TRUE(d.has_value());
ASSERT_EQ(d.value(), 5_KiB);
ASSERT_TRUE(e.has_value());
ASSERT_EQ(e.value(), 6_KiB);

allocator.deallocate(b.value());
allocator.deallocate(d.value());

auto f = allocator.allocate(1_KiB);
ASSERT_TRUE(f.has_value());
ASSERT_EQ(f.value(), 1_KiB);
}

TEST(FreeListOptTest, FirstFitAllocateAtAddressInteractions) {
auto allocator = tt::tt_metal::allocator::FreeListOpt(1_GiB, 0, 1_KiB, 1_KiB, tt::tt_metal::allocator::FreeListOpt::SearchPolicy::FIRST);
auto wedge = allocator.allocate_at_address(32_KiB, 1_KiB);

auto a = allocator.allocate(1_KiB);
ASSERT_TRUE(a.has_value());
ASSERT_EQ(a.value(), 0);

auto z = allocator.allocate(1_KiB, false);
ASSERT_TRUE(z.has_value());
ASSERT_EQ(z.value(), 1_GiB - 1_KiB);

auto b = allocator.allocate(1_KiB);
ASSERT_TRUE(b.has_value());
ASSERT_EQ(b.value(), 1_KiB);
}
1 change: 1 addition & 0 deletions tt_metal/impl/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ set(IMPL_SRC
${CMAKE_CURRENT_SOURCE_DIR}/buffers/semaphore.cpp
${CMAKE_CURRENT_SOURCE_DIR}/kernels/kernel.cpp
${CMAKE_CURRENT_SOURCE_DIR}/allocator/algorithms/free_list.cpp
${CMAKE_CURRENT_SOURCE_DIR}/allocator/algorithms/free_list_opt.cpp
${CMAKE_CURRENT_SOURCE_DIR}/allocator/allocator.cpp
${CMAKE_CURRENT_SOURCE_DIR}/allocator/basic_allocator.cpp
${CMAKE_CURRENT_SOURCE_DIR}/allocator/l1_banking_allocator.cpp
Expand Down
2 changes: 1 addition & 1 deletion tt_metal/impl/allocator/algorithms/allocator_algorithm.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ class Algorithm {

virtual Statistics get_statistics() const = 0;

virtual void dump_blocks(std::ofstream& out) const = 0;
virtual void dump_blocks(std::ostream& out) const = 0;

virtual void shrink_size(DeviceAddr shrink_size, bool bottom_up = true) = 0;

Expand Down
4 changes: 2 additions & 2 deletions tt_metal/impl/allocator/algorithms/free_list.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -431,12 +431,12 @@ Statistics FreeList::get_statistics() const {
return stats;
}

void FreeList::dump_block(const boost::local_shared_ptr<Block>& block, std::ofstream& out) const {
void FreeList::dump_block(const boost::local_shared_ptr<Block>& block, std::ostream& out) const {
auto alloc_status = this->is_allocated(block) ? "Y" : "N";
out << ",,," << (block->address + this->offset_bytes_) << "," << (block->size) << "," << alloc_status << "\n";
}

void FreeList::dump_blocks(std::ofstream& out) const {
void FreeList::dump_blocks(std::ostream& out) const {
out << ",,Blocks:,Address (B),Size (B),Allocated (Y/N)\n";
boost::local_shared_ptr<Block> curr_block = this->block_head_;
while (curr_block != nullptr) {
Expand Down
4 changes: 2 additions & 2 deletions tt_metal/impl/allocator/algorithms/free_list.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ class FreeList : public Algorithm {

Statistics get_statistics() const;

void dump_blocks(std::ofstream& out) const;
void dump_blocks(std::ostream& out) const;

void shrink_size(DeviceAddr shrink_size, bool bottom_up = true);

Expand Down Expand Up @@ -66,7 +66,7 @@ class FreeList : public Algorithm {
boost::local_shared_ptr<Block> next_free = nullptr;
};

void dump_block(const boost::local_shared_ptr<Block>& block, std::ofstream& out) const;
void dump_block(const boost::local_shared_ptr<Block>& block, std::ostream& out) const;

bool is_allocated(const boost::local_shared_ptr<Block>& block) const;

Expand Down
Loading

0 comments on commit e882125

Please sign in to comment.