The third part of the CLRS textbook, Data Structures, begins the talk about data structures and their applications. It is a very important part of the book, and should be read and practiced with the utmost care. This part covers chapters 10 to 14, and goes from page 197 to 318.
Chapter 10: Elementary Data Structures
In this chapter, elementary data structures and conventions which will be used throughout the book are introduced. A quick look at the chapter yields familiarity with stacks and queues (section 10.1), linked lists (section 10.2), and rooted trees (section 10.4). Also the notions used in the book for object and pointer implementation is introduced.
Chapter 11: Hash Tables
Hashing is a very frequently used concept in modern programming and software design. Hashing comes in handy when you want to store a lot of data and be able to do dictionary operations on them – namely, insertion, deletion, and retrieval (or searching). With a small amount of data, the use of arrays is quite reasonable and efficient. But when the data in question exceeds the limits of your available memory, it is quite different.
When we use the term direct-address tables, we are referring to plain, old arrays. In other words, to gain access to the place in the table of data where we have stored our information, we use function f(x)=x. However, suppose we have only six individual files we want to store, which have keys 100, 200, 300, 400, 500, and 600. So , if we want to store the sixth one, our array would reach a size of 600. Wouldn’t if be better if we stored each data in a cell with label key/100, so that our files would be saved in cells 1, 2, 3, 4, 5, and 6? This is exactly what hashing is all about.
While hashing, we use a certain quantitative characteristics of the data, and by means of hash function “h” we derive the destination. However, when using a hash function, we face the problem of two different keys hashing into the same spot. We call this situation a collision.
Collisions can be resolved by a method named chaining. When we use chaining, we are basically storing a linked list of data in each cell of our hash table. If our hash table has “m” slots, and we want to store “n” number of data, we call the ratio a=n/m the “load factor” for our hash table.
As theorems 11.1 and 11.2 state, in a hash table where collisions are resolved by chaining, a search in the hash table – whether successful or not – takes expected time where a is the load factor.
Creating a good hash function
A good hash function fulfills the assumption of simple uniform hashing: “each key is equally likely to hash to any of the m slots, independently of where any other key has hashed to” (page 229).
There are many ways to implement a relatively good hash function. The major methods include:
- The division method:
- The multiplication method:
- Universal hashing; which includes implementing several different hashing functions and then arbitrating between them by means of an intermediate function (pgs. 232-237)
Open Addressing
Open addressing, means never allowing the load factor to exceed one. That is, if a collision occurs, the colliding data is discarded. It is rather radical, but it’s quite practical.
In open addressing, we probe the table until we find an empty slot. The sequence of slots being probed depends on the key being hashed. Therefore, our hash function must accept the probe number as an input. The order in which the table is probed for a certain key, is called a probe sequence.
In probing the table, we make the assumption of uniform hashing, that “each key is equally likely to have any of the m! permutations of <0, 1, … , m-1>” (page 239).
We have many methods of generating a probe sequence, some of which being:
- Linear probing:
- Quadratic probing:
- Double hashing:
By theorem 11.6, the expected number of probes in an unsuccessful search in open addressing is , which is the same as the number of probes for insertion.
A successful search on an open-address hash table requires at most (theorem 11.8).
Chapter 12: Binary Search Trees
A binary search tree, is a binary tree, in which for every node A we can write: .
By theorem 12.1, we know that an in-order tree walk of the binary search tree takes time, “n” being the number of nodes.
Important operations on this data structure are: search, maximum, minimum, successor (the first node with the smallest key greater than the given key), insertion, and deletion.
Chapter 13: Red-Black Trees
A red-black tree is a binary search tree with one added property color for each node, which is either red or black. This data structure also forces the following constraints on the underlying binary search tree:
- Every node is either red or black
- The root is black
- Every leaf* is black
- If a node is red, then both its children are black
- For each node, all paths from that node to descendant leaves contain the same number of black nodes
*The leaves of the red-black tree are all NIL object (or one sentinel NIL object, for memory usage optimization).
By property 4, we can easily show that in a red-black tree, the longest path from the root to a leaf is at most twice as long as the shortest path between the root and a leaf node. This property ensures that the binary tree follows a relatively well-defined and balanced distribution.
When inserting data in the tree or otherwise modifying it, we must maintain the red-black properties defined above. This feat can be achieved by means of the LEFT-ROTATE and RIGHT-ROTATE functions introduced in the textbook (page 278).
Inserting in a red-black tree can be done in time.
Chapter 14: Augmenting Data Structures
In this chapter, modifying an existing data structure to suite our own needs is discussed. This process, which is called augmenting a data structure, can be done by – roughly – following the steps below:
- Choosing an underlying data structure
- Determining additional information to be maintained in the underlying data structure
- Verifying the maintenance of this information while performing basic operations
- Developing our own new operations
By way of example, the text book introduces dynamic order statistics problem and the interval tree, both of which are implemented using a red-black tree. To simplify step three, by using theorem 14.1 (page 309) we can show that if the contents of node x can be computed only using the information stored in x, left[x], and right[x] – including f(left[x]) and f(right[x]) – by the application of field “f” which augments the red-black tree, those values will be maintained during insertion and deletion without asymptotically affecting the running time of these operations.