0%

1 Implementation

A Winner Loser Tree is a specialized data structure that is typically used in sorting and merging algorithms, particularly in scenarios like external sorting or multiway merges. The concept is derived from tournament trees, which are a form of binary tree.

  • A Winner Loser Tree is a complete binary tree.
  • It has internal nodes and leaf nodes. The leaf nodes usually represent the elements being merged or sorted (like the heads of sorted arrays in multiway merging).
  • The tree is essentially a tournament where each internal node represents a match between two elements (the children nodes). In each match (internal node), one element is the “winner” (the smaller or larger one, depending on whether the goal is to sort in ascending or descending order), and the other is the “loser”. And the internal node always keeps the “lower”.
  • And we need to record the final “winner” additionally (outside of the tree).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
#include <algorithm>
#include <iostream>
#include <limits>
#include <random>
#include <vector>

class WinnerLoserTree {
private:
struct Cmp {
bool operator()(const int32_t& num1, const int32_t& num2) const { return num1 < num2; }
static constexpr int32_t INVALID = std::numeric_limits<int32_t>::max();
};

private:
const Cmp _cmp;
const size_t _size;
// Stores the actual leaf nodes.
std::vector<int32_t> _leaves;
// Stores the internal nodes (winners or losers) of the tree, which is actually the indexes of the leaf nodes.
// Each internal node store the loser of its children.
std::vector<size_t> _lower_tree;
// Store the final winner, which is the index of the leaf nodes.
size_t _winner;

void _adjust(size_t leaf) {
// For a complete binary tree, the leaf size is power of two, we have `_size - 1` internal nodes,
// And here, the leaf nodes and internal nodes are separately stored. So the actual index of the leaf node in the
// tree should be `leaf + (_size - 1)`, so the parent should be `(leaf + (_size - 1) - 1) / 2`
//
// For a non-complete binary tree, the formula still works, but I can't prove it here.
size_t parent = (leaf + (_size - 1) - 1) / 2;
while (true) {
if (leaf < _size && (_lower_tree[parent] == _size || !_cmp(_leaves[leaf], _leaves[_lower_tree[parent]]))) {
std::swap(leaf, _lower_tree[parent]);
}
if (parent == 0) {
break;
}
parent = (parent - 1) / 2;
}
_winner = leaf;
}

public:
WinnerLoserTree(const std::vector<int32_t>& nums)
: _cmp({}), _size(nums.size()), _leaves(nums), _lower_tree(_size, _size), _winner(_size) {}

void build() {
for (int32_t i = _size - 1; i >= 0; i--) {
_adjust(i);
}
}

int pop() {
int32_t winner_num = _leaves[_winner];
_leaves[_winner] = _cmp.INVALID;
_adjust(_winner);
return winner_num;
}
};

std::vector<int> merge_k_sorte_arrays(const std::vector<std::vector<int>>& arrays) {
std::vector<int> merged;
std::vector<int> all_elements;

for (const auto& arr : arrays) {
all_elements.insert(all_elements.end(), arr.begin(), arr.end());
}

WinnerLoserTree lt(all_elements);
lt.build();

for (int i = 0; i < all_elements.size(); i++) {
merged.push_back(lt.pop());
}

return merged;
}

int main() {
std::default_random_engine e;
std::uniform_int_distribution<size_t> u_size(1, 16);
std::uniform_int_distribution<int32_t> u_num(0, 100);

for (size_t test = 0; test < 100; ++test) {
std::vector<std::vector<int32_t>> arrays;
std::vector<int32_t> expected;
const size_t input_num = u_size(e);

for (size_t i = 0; i < input_num; ++i) {
const size_t input_size = u_size(e);
std::vector<int32_t> input;
for (size_t j = 0; j < input_size; j++) {
input.push_back(u_num(e));
}
expected.insert(expected.end(), input.begin(), input.end());
arrays.push_back(std::move(input));
}

std::vector<int32_t> result = merge_k_sorte_arrays(arrays);
std::sort(expected.begin(), expected.end());

for (size_t i = 0; i < result.size(); ++i) {
if (result[i] != expected[i]) {
throw std::logic_error("assertiont failed");
}
}
std::cout << "test(" << test << "): size=" << expected.size() << std::endl;
}

return 0;
}

1 Introduction

LSM-tree stands for Log-Structured Merge Tree. It is a data structure used in computer systems, particularly in storage and database systems, to efficiently manage and store large volumes of data with high write throughput.

The LSM-tree structure is designed to optimize write operations, which are typically more expensive in traditional storage structures. It achieves this by leveraging the characteristics of sequential disk I/O and reducing the number of random disk accesses.

The basic idea behind LSM-trees is to separate write and read operations by maintaining multiple levels of data structures. When data is written, it is first inserted into an in-memory component called a memtable, which serves as a write buffer. The memtable is typically implemented as a skip list or a hash table for efficient insertion and retrieval.

Periodically, the contents of the memtable are flushed to disk as sorted run files, also known as SSTables (Sorted String Tables). SSTables are immutable and ordered by the keys, which allows efficient sequential reads. Each SSTable corresponds to a specific time interval or a size limit.

To handle read operations efficiently, LSM-trees use a combination of the in-memory memtable and the on-disk SSTables. When a read request arrives, the LSM-tree first checks the memtable for the requested data. If the data is not found there, it searches the SSTables in a bottom-up manner, starting from the most recent SSTable to the oldest. This process is known as a merge operation.

To maintain consistency and prevent data inconsistencies between the memtable and SSTables, LSM-trees employ background compaction processes. Compaction merges overlapping or redundant data from SSTables into a new, compacted SSTable. This process reduces the number of SSTables and helps to optimize read performance.

LSM-trees provide several advantages:

  1. Efficient write performance: The LSM-tree structure optimizes write operations by buffering writes in an in-memory component and reducing disk random accesses.
  2. Scalability: LSM-trees can efficiently handle large volumes of data and scale horizontally by adding more machines to the system.
  3. Crash recovery: The LSM-tree’s append-only design and use of SSTables make crash recovery simpler and faster by replaying the write-ahead logs.
  4. Flexibility: LSM-trees can be tuned to balance the trade-offs between write and read performance based on system requirements.

LSM-trees are widely used in various storage and database systems, including Apache Cassandra, LevelDB, RocksDB, and HBase, to name a few.

1 Introduction

A Skip List is a data structure that provides an efficient way to search, insert, and delete elements in a sorted list. It is similar to a linked list with multiple layers of forward pointers, allowing for faster search times compared to a simple linked list.

The Skip List consists of nodes arranged in levels, with the bottom level containing all the elements of the list. Each node contains a key-value pair, and the keys are sorted in ascending order. In addition to the next pointer that points to the next node in the same level, each node also has a forward pointer that can skip some nodes and point to a node further down the list.

The forward pointers create a “skipping” effect, enabling efficient search operations. When searching for a specific key, the Skip List starts from the topmost level and moves forward until it finds a node with a key greater than or equal to the target key. It then drops down to the next level and continues the search until it reaches the bottom level.

The probability of including a node in a higher level is determined by a randomized process. Typically, nodes are included in higher levels with a decreasing probability, resulting in a pyramid-like structure of levels.

The benefits of Skip List include:

  1. Fast search operations: With multiple levels and skipping pointers, the Skip List provides efficient search times similar to balanced search trees.
  2. Simple implementation: Skip Lists are easier to implement compared to other balanced search tree data structures such as AVL trees or red-black trees.
  3. Dynamic structure: Skip Lists can easily handle insertions and deletions without the need for rebalancing operations, making them suitable for dynamic data sets.
  4. Space efficiency: The space complexity of a Skip List is proportional to the number of elements, making it a viable choice for memory-constrained environments.

Skip Lists are commonly used in various applications, including databases, concurrent data structures, and randomized algorithms that require efficient search operations.

1
2
3
4
5
Level 3:           Head ----------------------------------------> 50 -------------> NULL
| |
Level 2: Head -------------> 20 ----------------------------------------> 70 -----> NULL
| | |
Level 1(bottom): Head ----> 10 ----> 20 ----> 30 ----> 40 ----> 50 ----> 60 ----> 70 ----> 80 ----> 90 ----> NULL