Let $S$ be a string of length $N$. The $i^{th}$ suffix of $S$ is substring $S[i \ldots n - 1], i = 0 \ldots n - 1$.

As a data structure, it is widely used in areas such as data compression, bioinformatics and, in general, in any area that deals with strings and string matching problems, so, as you can see, it is of great importance to know efficient algorithms to construct a suffix array for a given string.

A **Suffix Array** will contain integers that represent the **starting indexes** of the all the suffixes of a given string, after the aforementioned suffixes are sorted.

Let $S\ =\ abaab$

All suffixes are as follows

$0.\ abaab$

$1.\ baab$

$2.\ aab$

$3.\ ab$

$4.\ b$

After sorting these strings:

$2.\ aab$

$3.\ ab$

$0.\ abaab$

$4.\ b$

$1.\ baaab$

Suffix Array for $S$ will be $(2,\ 3,\ 0,\ 4,\ 1)$.

This is the most naive approach. Get all the suffixes and sort them using Quicksort or Mergesort simultaneously retaining their original indexes. Sorting takes $O(N\log{N})$ so how it is $O(N^2\log{N})$??. What we miss here is comparison of two strings is not $O(1)$ in worst case it takes $O(N)$. So the final complexity is $O(N^2\log{N})$. We can reduce this comparison from $O(N)$ to $O(1)$. Read further :).

We can reduce comparison of two strings from $O(N)$ to $O(1)$ using the fact that, given strings are not
random strings they are part of single string. **Each string has something in common with others.**

Let's see how we can use this fact.
Let's sort the suffixes on basis of their first character and assign them rank.

If two are equal rank for them will be same.

$0.\ a|baab$

$0.\ a|ab$

$0.\ a|b$

$1.\ b|aab$

$1.\ b|$

**Note:** '|' denotes sorting is based on left part of '|'.

Now double the characters to take from each for sorting i.e. 2.

When we take string of two chars we can have two parts first containing 1 char other containing 1.

Let's compare $abaab$ with $baab$, based on the first part, that of 1 character we can say that $abaab$ will

be always ranked above $baab$ so skip further comparison.

Now compare $abaab$ with $aab$ based on first part both have same rank. Now we will compare their second half part,

second half part of $abaab$ is only $b$ and for $aab$ be is $a$ for these we already know their ranks

for $b$ (i.e. whole $baab$) is 1 and $a$ (i.e. whole $ab$) is 0.

Hence $abaab$ will be ranked above $baab$.

For the strings not having second part we will rank their second part highest i.e. -1 for example $b$ is not having $2^{nd}$ char so its rank tuple will be (1, -1).

$0.\ aa|b$

$1.\ ab|aab$

$1.\ ab|$

$2.\ b|$

$2.\ ba|ab$

Now, in the next iteration, we sort 4-characters strings. This involves a lot of comparisons between different 4-characters strings.

How do we compare two 4-characters strings? Well, we could compare them character by character. That would be up to 4 operations per comparison.

But instead, we compare them by looking up the ranks of the two characters contained in them, using the rank table generated in the previous steps.

That rank represents the lexicographic rank from the previous 2-charater sort, so if any given 4-character string has a higher rank than another 4-character string, then it must be lexicographically greater somewhere in the first two characters.

Hence, if for two 4-characters string the rank is identical, they must be identical in the first two characters.

In other words, two look-ups in the rank table are sufficient to compare all 4 characters of the two 4-characters strings.

Similarly we can compare 8, 16, 32 $\ldots$ strings in at most two integer comparisons.

i.e. $O(1)$ comparison.

Following fully commented code will clarify more.

C++ implementation Show/Hide

```
#include bits/stdc++.h
using namespace std;
// suffixRank is table hold the rank of each string on each iteration
// suffixRank[i][j] denotes rank of jth suffix at ith iteration
int suffixRank[20][int(1E6)];
// Example "abaab"
// Suffix Array for this (2, 3, 0, 4, 1)
// Create a tuple to store rank for each suffix
struct myTuple {
int originalIndex; // stores original index of suffix
int firstHalf; // store rank for first half of suffix
int secondHalf; // store rank for second half of suffix
};
// function to compare two suffix in O(1)
// first it checks whether first half chars of 'a' are equal to first half chars of b
// if they compare second half
// else compare decide on rank of first half
int cmp(myTuple a, myTuple b) {
if(a.firstHalf == b.firstHalf) return a.secondHalf < b.secondHalf;
else return a.firstHalf < b.firstHalf;
}
int main() {
// Take input string
// initialize size of string as N
string s; cin >> s;
int N = s.size();
// Initialize suffix ranking on the basis of only single character
// for single character ranks will be 'a' = 0, 'b' = 1, 'c' = 2 ... 'z' = 25
for(int i = 0; i < N; ++i)
suffixRank[0][i] = s[i] - 'a';
// Create a tuple array for each suffix
myTuple L[N];
// Iterate log(n) times i.e. till when all the suffixes are sorted
// 'stp' keeps the track of number of iteration
// 'cnt' store length of suffix which is going to be compared
// On each iteration we initialize tuple for each suffix array
// with values computed from previous iteration
for(int cnt = 1, stp = 1; cnt < N; cnt *= 2, ++stp) {
for(int i = 0; i < N; ++i) {
L[i].firstHalf = suffixRank[stp - 1][i];
L[i].secondHalf = i + cnt < N ? suffixRank[stp - 1][i + cnt] : -1;
L[i].originalIndex = i;
}
// On the basis of tuples obtained sort the tuple array
sort(L, L + N, cmp);
// Initialize rank for rank 0 suffix after sorting to its original index
// in suffixRank array
suffixRank[stp][L[0].originalIndex] = 0;
for(int i = 1, currRank = 0; i < N; ++i) {
// compare ith ranked suffix ( after sorting ) to (i - 1)th ranked suffix
// if they are equal till now assign same rank to ith as that of (i - 1)th
// else rank for ith will be currRank ( i.e. rank of (i - 1)th ) plus 1, i.e ( currRank + 1 )
if(L[i - 1].firstHalf != L[i].firstHalf || L[i - 1].secondHalf != L[i].secondHalf)
++currRank;
suffixRank[stp][L[i].originalIndex] = currRank;
}
}
// Print suffix array
for(int i = 0; i < N; ++i) cout << L[i].originalIndex << endl;
return 0;
}
```

- Uva 760 - DNA Sequencing
- Uva 1223 - Editor
- Codechef - Tandem
- Codechef - Substrings and Repetitions
- Codechef - Entangled Strings
- Codeforces - Martian Strings
- Codeforces - Little Elephant and Strings
- SPOJ - Ada and Terramorphing
- SPOJ - Ada and Substring
- UVA - 1227 - The longest constant gene
- SPOJ - Longest Common Substring
- UVA 11512 - GATTACA
- LA 7502 - Suffixes and Palindromes
- GYM - Por Costel and the Censorship Committee
- UVA 1254 - Top 10
- UVA 12191 - File Recover
- UVA 12206 - Stammering Aliens
- Codechef - Jarvis and LCP
- LA 3943 - Liking's Letter
- UVA 11107 - Life Forms
- UVA 12974 - Exquisite Strings
- UVA 10526 - Intellectual Property
- UVA 12338 - Anti-Rhyme Pairs
- DevSkills Reconstructing Blue Print of Life
- UVA 12191 - File Recover
- SPOJ - Suffix Array
- LA 4513 - Stammering Aliens
- SPOJ - LCS2
- Codeforces - Fake News (hard)
- SPOJ - Longest Commong Substring
- SPOJ - Lexicographical Substring Search
- Codeforces - Forbidden Indices
- Codeforces - Tricky and Clever Password
- LA 6856 - Circle of digits

**Upcoming: $O(N\log(N))$ approach, $O(N)$ approach and LCP array**