A high level introduction to FibyTree

Maxim Zaks
AI Mind
Published in
6 min readAug 19, 2023

--

FibyTree is an optionally fully balanced and complete implicit binary search tree, which has the same semantics as a sorted set. I did a reference implementation of this data structure in Mojo, but fear not if you have no idea about Mojo, this article is very high level. So without further ado, lets start from the beginning.

What is a tree?

A tree is a graph where one node has 0 incoming edges and all the others have only one. The node with 0 incoming edges is called the root of the tree.

What is a binary tree?

Binary tree is a tree where every node has at most 2 outgoing edges.

Normally we define a binary search tree as following:

class Node:
def __init__(self):
self.left = None
self.right = None

Where the two outgoing edges are represented with left and right fields. These fields are either set to None or point to another Node instance.

This example is a typical reference based tree data structure. As a Node instance stores two pointers, it occupies 16 bytes on 64-bit address space machine. Fields left and right point to a Node instance which can be at arbitrary address and hence this representation does not guarantee data locality and probably will involve CPU cache misses during traversal.

This leads us to the idea of an implicit, or space efficient data structures. Here is how FibyTree stores it’s outgoing edges:

struct FibyTree:
var left: DynamicVector[UInt16]
var right: DynamicVector[UInt16]

fn __init__(inout self):
self.left = DynamicVector[UInt16]()
self.right = DynamicVector[UInt16]()

In FibyTree we represent the edges as indices. We decided to use unsigned 16 bit int as an index, meaning that our tree can have at most 2¹⁶ = 65.536 nodes. This however also mean that our overhead is only 4 bytes per node. When the left and right vectors are empty we have an empty tree. The root node is always placed at index 0. If a node does not have children it stores it’s own index. So for example:

left: [0]
rigth: [0]

Means that we have only one root node in the tree.

left: [1, 1]
right:[0, 1]

Represents a tree which has a tree with a root and only left child

left: [0, 1]
right:[1, 1]

Is the opposite case — a root with only right child

left: [1, 1, 2]
right: [2, 1, 2]

Is a tree where the root node has a left and a right child

left: [2, 1, 2]
right: [1, 1, 2]

Is logically equivalent to previous tree, the only difference is, we flipped the children indices, now left child is at index 2 and right child is at index 1

left:[1, 2, 3, 4]
right: [0, 1, 2, 3]

Is a tree of 4 nodes, where every node has only a left child.

BTW, representing 4 nodes in this way will take 16 bytes, the same amount of bytes as it takes to represent just one node as reference based tree.

The way we currently represent nodes is kind of pointless as the nodes do not cary any data, they just encode the tree structure. In order for a node to represent data we would need to associate data with the node:

class Node:
def __init__(self, data):
self.data = data
self.left = None
self.right = None

In the Node class, data property is another pointer, which points to an arbitrary location in memory and moves the size of a Node instance to 24 bytes. Important to notice, this size is pure overhead cost for associating the data as tree, the size of the data itself is unknown.

struct FibyTree[T: AnyType]:
var elements: DynamicVector[T]
var left: DynamicVector[UInt16]
var right: DynamicVector[UInt16]

fn __init__(inout self):
self.elements = DynamicVector[T]()
self.left = DynamicVector[UInt16]()
self.right = DynamicVector[UInt16]()

In order to associate data with a FibyTree, we introduce an elements filed which is a vector of generic type T. This way FibyTree owns the data, the data is stored in contiguous memory and has no overhead per node.

Lets talk about binary search trees now

A binary search tree is a binary tree, wich stores comparable elements in such way that a left child of the root contains data which is smaller then the root data and the right child data is bigger then the root data. This way by traversing the tree, we can identify which branch to take in order to find the node with desired element. Because of this property, a binary search tree can be seen as a sorted set, where when we add an element, which is not represented by a node yet, a new node will be created as a child of the leaf node, where we unsuccessfully finished our search. I will not go into further details about the properties of the binary search trees as there is enough material online.

In order to make FibyTree a binary search tree and a semantic equivalent of a sorted set we provide following API:

struct FibyTree[T: AnyType, cmp: fn(T, T)->Int]:
var elements: DynamicVector[T]
var left: DynamicVector[UInt16]
var right: DynamicVector[UInt16]
var deleted: Int

fn __init__(inout self):
self.elements = DynamicVector[T]()
self.left = DynamicVector[UInt16]()
self.right = DynamicVector[UInt16]()
self.deleted = 0

fn add(inout self, element: T):
fn delete(inout self, element: T) -> Bool:
fn sorted_elements(self) -> UnsafeFixedVector[T]:
fn clear(inout self):
fn union(self, other: Self) -> Self:
fn intersection(self, other: Self) -> Self:
fn difference(self, other: Self) -> Self:
fn symetric_difference(self, other: Self) -> Self:
fn is_subset(self, other: Self) -> Bool:
fn is_superset(self, other: Self) -> Bool:
fn is_disjoint(self, other: Self) -> Bool:
fn __len__(self) -> Int:
fn __contains__(self, element: T) -> Bool:

With this API we can add, delete and search for elements in the FibyTree. We can return all elements as a sorted vector, produce a new tree which is a union, intersection, difference or symmetric difference of two FibyTrees. We can check if one tree is a subset, superset, or disjoint from another. We can get a length of the tree and also clear it.

Lets talk about last but not least feature of FibyTree. We can turn it in to complete fully balance binary search tree.

What does it mean though?

When we add elements to binary search tree we add new nodes at the bottom, which can lead to tree degradation, worst case scenario we add sorted elements one after another, which will lead to a single branch tree, which is basically a linked list. Balancing the tree means that we strive to minimise the difference between depths of sub trees. A complete binary tree is a binary tree, where each level is fully filled except for the last one. A good example for a complete binary tree is a heap. A Complete fully balanced binary search tree guarantees a ceil(log2(n + 1)) depth for the tree and therefor at most ceil(log2(n + 1)) comparisons for search.

In order to balance the FibyTree we sort the elements and left / right indices in eytzinger order (BTW if you were wondering this is where FibyTree gets its y from) wich has O(n) runtime complexity. When a FibyTree is fully balanced we don’t even need to lookup left and right element by index, we can use the eytzinger formula:

left = (n + 1) * 2 – 1
right = (n + 1) * 2

to look left / right child.

This article is only the high level introduction. I will write other articles where I will discuss performance benchmarks and implementation details.

You can find the very early (not fully optimised and with some copy paste code) but already complete implementation of FibyTree in Mojo bellow.

Thank you for reading.

Please leave a 👏 if you liked it.

A Message from AI Mind

Thanks for being a part of our community! Before you go:

--

--

Tells computers how to waste electricity. Hopefully in efficient, or at least useful way.