《算法》第四版algs4:字符串排序算法C++实现

1.键索引计数法
适用于小整数键的简单排序。
具有稳定性（稳定性：一个排序算法能够保留数组中相同元素的相对位置，则它是稳定的）
突破了NlogN的排序算法时间下限：因为它这里不需要比较键，只需要访问数据即可。

#include <vector>
#include <string>
#include <iostream>

struct info {
    info() {info(0, "");}
    info(int k, std::string v): key(k), val(v) {}
    int key;
    std::string val;
};

void printVector(const std::vector<info>& vec) {
    for (const auto &item : vec)
        std::cout << item.key << " " << item.val << std::endl;
    std::cout << std::endl;
}

int main() {
    std::vector<info> a{info(2, "Anderson"), info(3, "Brown"), info(3, "Davis"), info(4, "Garcia"), info(1, "Harris"), info(3, "Jackson")};
    std::cout << "before sorting: " << std::endl;
    printVector(a);

    int R = 5;
    int N = a.size();
    std::vector<info> aux(N);
    std::vector<int> count(R+1);
    for (int i = 0; i < N; ++i)
        count[a[i].key + 1]++;
    for (int r = 0; r < R; r++)
        count[r+1] += count[r];
    for (int i = 0; i < N; ++i)
        aux[count[a[i].key]++] = a[i];
    for (int i = 0; i < N; ++i)
        a[i] = aux[i];
    
    std::cout << "after sorting: " << std::endl;
    printVector(a);
}

2.低位优先的字符串排序LSD
适用于键的长度都相等的字符串排序应用。
基本思想：如果字符串的长度均为W，那就从右向左以每个位置的字符作为键，用键索引计数法将字符串排序W遍，且该排序方法是稳定的。
这种方法行得通的基础是：键索引计数法是稳定的。
方法等价于进行W轮键索引计数法。

#include <iostream>
#include <vector>
#include <string>
#include <limits.h>

using namespace std;

void printVec(const vector<string> &a);
void LSDSort(vector<string> &a, int W);

int main() {
    vector<string> a;
    string s;
    while (cin >> s)
        a.push_back(s);
    cout << "before sorting:" << endl;
    printVec(a);

    LSDSort(a, a[0].size());

    cout << "after sorting:" << endl;
    printVec(a);
}

void printVec(const vector<string> &a) {
    for (const auto &item : a)
        cout << item << " ";
    cout << endl;
}

void LSDSort(vector<string> &a, int W) {
    int N = a.size();
    int R = CHAR_MAX;
    vector<string> aux(N);

    for (int d = W-1; d >= 0; d--) {
        vector<int> count(R+1);
        for (int i = 0; i < N; ++i)
            count[(int)a[i][d] + 1]++;
        for (int r = 0; r < R; ++r)
            count[r+1] += count[r];
        for (int i = 0; i < N; ++i)
            aux[count[(int)a[i][d]]++] = a[i];
        for (int i = 0; i < N; ++i)
            a[i] = aux[i];
    }
}

3.高位优先的字符串排序MSD
从左到右检查键中的字符。
吸引人的地方在于它们不一定要检查所有的输入就能完成排序。
与快排类似，它们都会将需要排序的数组切分为独立的部分并递归的用相同的方法处理子数组来完成排序。区别之处在于MSD算法在切分时仅使用键的第一个字符，而快排的比较会涉及键的全部。

大致过程是将所有字符串按照首字母排序，然后递归地再将每个首字母所对应的子数组排序（忽略首字母）。
因为每个字符串的长度不等，所以需要注意到达字符串末尾的情况。做法是把到达末尾的字符串放到它所在的子数组的最前面，使用的一个charAt()函数，当到达末尾时，返回-1。

下面是使用MSD要注意的，总结起来两点：1.小数组要切换到插入排序2.注意数据中如果相同前缀的键太多（甚至键全部相同），MSD效率很低
（需要注意，对于较大的字母表，MSD可能很危险，可能会消耗很多的时间空间。

小数组的处理：对于MSD尤其重要。假设将一个数百万个不同的ASCII字符串（R=256）排序并且不对小数组做任何处理，那么每个字符串最终都会产生一个只有他自己的子数组，因此需要将数百万个大小为1的子数组排序。更重要的是，每次都需要产生一个258个元素的count数组，这里的代价是最高的。使用unicode时（R=65536），排序可能减慢上千倍。因此，将小数组切换到插入排序对于MSD来说是必须的。在长度小于等于10时将子数组切换到插入排序能够将运行时间降低为原来的十分之一。

MSD对于含有大量等值键的子数组排序会比较慢，因为很难产生小数组然后切换到插入排序，而会检查相同键中的每一个字符。MSD的最坏情况就是所有键都相同，大量含有相同前缀的键也会产生同样问题。

因为count数组不能在递归方法之外创建（不像aux数组），所以空间也是个问题。

MSD的性能主要取决于数据。对于随机输入，MSD只会检查一部分的字符（足以区别字符串），运行时间是亚线性的；对于非随机输入，仍然可能亚线性，但是需要检查的字符要更多；最坏情况，即所有键相同，MSD会检查所有键中的所有字符。

所以MSD应用的主要挑战在于处理数据中的非随机因素。）

#include <iostream>
#include <vector>
#include <string>
#include <limits.h>

using namespace std;

vector<string> aux;
int R = CHAR_MAX;
int M = 15;

void printVec(const vector<string> &a);
void MSDSort(vector<string> &a, int lo, int hi, int d);
void insertionSort(vector<string> &a, int lo, int hi, int d);
bool smaller(string &v, string &w, int d);
int charAt(const string &s, int d);


int main() {
    vector<string> a;
    string s;
    while (cin >> s)
        a.push_back(s);
    cout << "before sorting:" << endl;
    printVec(a);

    aux = vector<string>(a.size());
    MSDSort(a, 0, a.size()-1, 0);

    cout << "after sorting:" << endl;
    printVec(a);
}

void printVec(const vector<string> &a) {
    for (const auto &item : a)
        cout << item << " ";
    cout << endl;
}

int charAt(const string &s, int d) {
    if (d < s.size())
        return (int)s[d];
    else
        return -1;
}

void MSDSort(vector<string> &a, int lo, int hi, int d) {
    if (hi <= lo + M) {
        insertionSort(a, lo, hi, d);
        return;
    }
    auto count = vector<int>(R+2, 0);
    for (int i = lo; i <= hi; ++i)
        count[charAt(a[i], d)+2]++;//如果已经超出了长度，则放到最开始
    for (int r = 0; r < R+1; r++)
        count[r+1] += count[r];
    for (int i = lo; i <= hi; ++i)
        aux[count[charAt(a[i], d)+1]++] = a[i];
    for (int i = lo; i <= hi; ++i)
        a[i] = aux[i-lo];
    
    for (int r = 0; r < R; r++)
        MSDSort(a, lo+count[r], lo+count[r+1]-1, d+1);
    
}

void insertionSort(vector<string> &a, int lo, int hi, int d) {
    using std::swap;
    for (int i = lo; i <= hi; i++)
        for (int j = i; j > lo && smaller(a[j], a[j-1], d); --j)
            swap(a[j], a[j-1]);
}

bool smaller(string &v, string &w, int d) {
    return v.substr(d) < w.substr(d);
}

4.三向字符串快速排序
根据键的首字母进行三向切分，仅在中间子数组中的下一个字符继续递归调用。
是快排和MSD的结合。
相对于MSD，优点是能够很好地处理等值键、有较长公共前缀的键、取值范围较小的键和小数组——所有MSD不擅长的各种情况。特别是能够适应键的不同部分的不同结构。和快排一样，也不需要额外的空间，也是它相对于MSD的一大优点，MSD在统计频率以及使用辅助数组时都需要额外空间。相对于MSD的一个缺点是，因为三向字符串快速排序只将数组切分为三部分，因此当相应的MSD产生的非空切分较多时，就需要进行较多次的三向切分才能取得相同的效果，所以移动的数据量就会变大。
同样的，可以对小数组进行特殊处理，使用插入排序，但重要性远不如它在MSD中来的高。
和快排一样，为了预防最坏情况，最好在排序前将数组打乱。

扫描二维码关注公众号，回复： 4472959 查看本文章

#include <vector>
#include <string>
#include <iostream>

using namespace std;

void Quick3string(vector<string> &a, int lo, int hi, int d);
void printVec(const vector<string> &a);
int charAt(const string &s, int d);

int main() {
    vector<string> a;
    string s;
    while (cin >> s)
        a.push_back(s);
    cout << "before sorting:" << endl;
    printVec(a);

    Quick3string(a, 0, a.size()-1, 0);

    cout << "after sorting:" << endl;
    printVec(a);
}

void printVec(const vector<string> &a) {
    for (const auto &item : a)
        cout << item << " ";
    cout << endl;
}

int charAt(const string &s, int d) {
    if (d < s.size())
        return (int)s[d];
    else
        return -1;
}

void Quick3string(vector<string> &a, int lo, int hi, int d) {
    using std::swap;
    if (hi <= lo) return;
    int lt = lo, gt = hi;
    int v = charAt(a[lo], d);
    int i = lo + 1;
    while (i <= gt) {
        int t = charAt(a[i], d);
        if (t < v) swap(a[lt++], a[i++]);
        else if (t > v) swap(a[i], a[gt--]);
        else i++;
    }
    Quick3string(a, lo, lt-1, d);
    if (v >= 0) Quick3string(a, lt, gt, d+1);
    Quick3string(a, gt+1, hi, d);
}

至此algs4上的5.1章字符串排序算法实现完毕，在书上的471页有各个字符串排序算法的比较，需要看一下。

《算法》第四版algs4:字符串排序算法C++实现

猜你喜欢