C++ regular grouping

Preface

The input source code is regarded as a string, and I need to identify the identifiers, numbers, plurals, functions, keywords, etc. in it. Before doing calculators improved version of the time, with two matches, first division Token, the second classification Token, but I want this classification step a little faster, if we can get the first time division packet information also The second match can be omitted!


$n,n>0

Groups can also be obtained with regex_replace , but what it returns is a combination of the results of the same group, and only one string is returned. I also thought about splitting this string. If we use regularization again, it will violate my requirements! So I tried to split with spaces, but some results of this string are spaces, some are not.

string result = regex_replace(str,reg,"$1");//返回匹配reg的第一个分组的内容

Match result

The matching result is placed in the smatch class, and the content of the iterator is the smatch class. The size() of the
smatch class refers to the number of groups, that is, the number of left parentheses in the regular. .str(n) represents the content of the nth packet. When n=0, it means the total matching result, and it is the same if no parameter is passed in.


	regex reg1("([a-zA-Z\\_]\\w*)|(,)|([\\+\\-\\*\\/\\^\\%\\(\\)])|(\\d+(\\.\\d+)?(e[\\+\\-]?\\d+)?)");
	string str1 = "43+12 - Print(3,4)*7";
	for (sregex_iterator it(str1.begin(), str1.end(), reg1), it_end; it != it_end; ++it) {
    
    
		for (int i = 0; i < 8; ++i) {
    
    
			cout << i << ">" << (*it).str(i) << "\t";			
		}
		cout << endl;	
	}

	

operation result

0>43    1>      2>      3>      4>43    5>      6>      7>
0>+     1>      2>      3>+     4>      5>      6>      7>
0>12    1>      2>      3>      4>12    5>      6>      7>
0>-     1>      2>      3>-     4>      5>      6>      7>
0>Print 1>Print 2>      3>      4>      5>      6>      7>
0>(     1>      2>      3>(     4>      5>      6>      7>
0>3     1>      2>      3>      4>3     5>      6>      7>
0>,     1>      2>,     3>      4>      5>      6>      7>
0>4     1>      2>      3>      4>4     5>      6>      7>
0>)     1>      2>      3>)     4>      5>      6>      7>
0>*     1>      2>      3>*     4>      5>      6>      7>
0>7     1>      2>      3>      4>7     5>      6>      7>
请按任意键继续. . .

It is obvious from the above that the index of the largest non-empty value of different types of substrings can distinguish their types!

We only need to record this index, and then classify different substrings according to regular grouping.

	for (sregex_iterator it(str1.begin(), str1.end(), reg1), it_end; it != it_end; ++it) {
    
    
		//cout << it->str() << ",len=" << it->length() <<","<< endl;
		
		bool flag = false;
		int j = 0, k = 0;
		//cout << "size="<<(*it).size() << endl;
		for (int i = 0; i < 7; ++i) {
    
    
			if (i != 0) {
    
    
				if ((*it).str(i) != "") {
    
    
					flag = true;  //当遇到非空的后,就不再计算空的个数
					++k;
				}
				else {
    
    
					if (!flag)	++j;
				}
			}
			cout << i << ">" << (*it).str(i) << "\t";			
		}
		cout << "\tindex=" << j + k << endl;
		
	}
  • Originally I wanted to write this in the early morning to explain how to do regular classification in C++ or to do secondary matching! Because it is too slow to get the grouping information through the matching method once again, the method I thought at the time was to add brackets that increase track by track to different groups 2333. After writing half of the article in the morning, I tested it again in the evening and found that the grouping information without extra parentheses was displayed, and the more regular parentheses, the lower the performance. You should write as few parentheses as possible!
  • Also for performance, the \\+ sign in the brackets should not be \\, and there are minus signs, dots, question marks, parentheses, curly brackets, asterisks, etc. After placing the brackets, use the original symbols. Do not add two escape characters.

Guess you like

Origin blog.csdn.net/weixin_41374099/article/details/104106090