Heap structure and heap sort algorithm

Heap data structure

Definition.
A tree is heap-ordered if the key in each node is larger than or equal to the keys in all of that node's children (if any). Equivalently, the key in each node of a heap-ordered tree is smaller than or equal to the key in that node's children (if any).
The most important property of a heap-ordered tree is that the root always contains the largest (smallest) element.
A heap-ordered tree.

The tree on the picture above is a complete binary tree, and in this case we can use an array to store all information about the nodes. To keep the complete information about a tree in an array we will enumerate all the vertices starting from the root, then sequentially nodes of each level from left to right as shown below.

A heap-ordered tree mapped to an array.
Definition.
A heap is a set of nodes with keys arranged in a complete heap-ordered binary tree, represented as an array.
We need to distinguish max-heaps (the root of the tree contains the largest element) and min-heaps (the root of the tree contain the smallest element).

It is not difficult to see the rules for enumeration of a parent node and its children in heaps. If a node has index n in the array, then its left child has index 2n+1 and its right child has index 2n+2. Accordingly, its parent has index [(n-1)/2].

We can define the following functions that will be useful in the future:
int parent(int n)  {  return (n-1)/2;  }
int left_child(int n)  {  return 2*n+1;  }
int right_child(int n) {  return 2*n+2;  }

Building a heap

The next question we are about to discuss is how to build a heap. Let's assume that we have the following tree:
Not a heap at all.
Obviously, this is not a heap-ordered tree because the red numbers violate the restriction. To construct a heap-ordered tree out of this one we will use a simple idea. We will design a simple procedure heapify that looks at a node and its children, finds the biggest element out of these three and puts it in this small tree. For example, after looking at picture A below this procedure should find the value 11 and swaps it with the value 8 (as shown on picture B):
A. Before.
B. After.
If the procedure works under the assumption that the both children are roots of heap-ordered subtrees, then this procedure should recursively call itself for the child that had the largest value (the one moved up). For example, after using this procedure against the node containing the value 8 we will have
Thus, our heapify function applied to an array of the size heap_size can look like
void max_heapify(int data[], int heap_size, int node)
{
   int l = left_child(node),
       r = right_child(node);
   if( l >= heap_size ) return; // this node does not have children
   int max_ind =  data[node]<data[l] ? l : node;
   if( r<heap_size && data[max_ind]<data[r] )
      max_ind = r;
   // if the largest value belongs to one of the children then ...
   if( max_ind != node ){
      exch(data[node], data[max_ind]);       // ... exchange it with the node ...
      max_heapify(data, heap_size, max_ind); // ... and make sure this child's subtree is still OK
   }
}
where data is an array we are trying to convert to the heap, heap_size is the size of the heap, and node is the index of the node we apply this function to.

This function will successfully work only if both children of a tested node are the roots of "good" subtrees (that is, those subtrees are heap-ordered). To satisfy this condition while building a heap out of an array, we need to start from the nodes of lowest level. More exactly, we can start with the nodes whose children are terminal nodes (that is, do not have any children by themselves). For example, on the pictures above we can start with the nodes with indexes 5, 4, 3 and go up to the root. Thus, the implementation of the build_max_heap function is very simple

void build_max_heap(int data[], int N)
{
   for(int node=N/2-1; node>=0; node--)
      max_heapify(data, N, node);
}
where data is an array to be converted, and N is th size of the array. Let us note that for future use we distinguish the value N (the size of the array data) and value heap_size (the number of elements in the heap stored within array data).

Heap sort

The heap is a very important data structure and widely used in programming. It also can be used for sorting arrays. The idea of sorting using the heaps is based on the
main property of the heaps. Assume we have an array data of the size N, that was converted to a heap. Thus, the size of the heap (heap_size) is also equal to N. If we would like to sort this array, then we can use the property that the root (or the element with index zero) always contain the largest element. Since the right place for the largest element is at the very end of the array, we need to swap the last element and the first one. By doing this we, of course, ruined the heap, but we can apply the max_hepify procedure to the root node and restore the heap-ordered structure. Although, this restoration will bring the biggest element (the one we put at the end) to its original place at the beginning of the array, to avoid this, we need to decrease the value of the heap_size variable.

After restoring the heap structure on a sub-array data [0, N-2], we know that the second largest element now takes the first position in the array. So, we swap it with the element #N-2 and again restore the heap-order by using max_heapify(data, --heapsize, 0).

Keep repeating such steps until we put all elements in their right positions.


Glossary

Definition.
A tree is a nonempty collection of edges and nodes that has exactly one path connecting any two nodes.
Definition.
A node (or vertex) is simple object that can have a name and can carry other associated information.
Definition.
An edge is a connection between two nodes.
Definition.
A path in a tree is a list of distinct nodes in which successive nodes are connected by edges in the tree.
Definition.
A binary tree is an ordered (means we distinguish the children of a node) tree consisting of two types of nodes:
Definition.
A complete binary tree is one with all levels filled, except possibly the final one, which is filled from left to right.
Definition.
Nodes with no children are called leaves, or terminal nodes.