Protobuf (a) - Protobuf Profile

Protobuf Profile

What is Google Protocol Buffer? If you search online, you should get such a text similar to the description:

Google Protocol Buffer (referred Protobuf) is a mixed-language Google's internal data standards, there are already being used by more than 48,162 kinds of message format definition and a .proto more than 12,183 files. They continued for RPC systems and data storage systems.

Protocol Buffers is a lightweight, highly efficient data storage structure format, the structure may be used for data serialization, or serialization. It is suitable for data storage or RPC data exchange format. Protocol languages ​​may be used, regardless of the data storage fields, platform-independent, scalable serial data format structure. Currently provides C ++, Java, Python three languages ​​API.

Maybe you're like me, the first time after reading these presentations Protobuf still do not understand what exactly it is, then I want a relatively simple example should help to understand it.

1, a simple example

1, install Google Protocol Buffer

Site on http://code.google.com/p/protobuf/downloads/list can download Protobuf source code. Then extract compiled and installed you can use it.

安装步骤如下所示:
 tar -xzf protobuf-2.1.0.tar.gz 
 cd protobuf-2.1.0 
 ./configure --prefix=$INSTALL_DIR 
 make 
 make check 
 make install
2, an example of a simple description of

Protobuf and I intend to use C ++ to develop a very simple example program.

The program consists of two parts. The first part is called Writer, the second part is called Reader.

Writer is responsible for some of the structured data written to a disk file, Reader is responsible for structured data read from the disk file and print to the screen.

Prepared for presentation data is structured HelloWorld, which contains two basic data:

ID,为一个整数类型的数据
Str,这是一个字符串
3, writing .proto file

First we need to write a proto file, we define the structure of the program data need to be processed, the term protobuf, the data structure is referred to as Message. proto data definition file is very similar to java or C language. Listing 1 shows the contents of proto file example applications.

//清单 1. proto 文件
 package lm; 
 message helloworld 
 { 
    required int32     id = 1;  // ID 
    required string    str = 2;  // str 
    optional int32     opt = 3;  //optional field
 }

A good habit is to take seriously the proto file name. For example, the naming convention is scheduled for the following:

​ packageName.MessageName.proto

In the above example, Package name is LM, HelloWorld defines a message, the message has three members, the type id int32, another member of type string str. opt is an optional member, i.e., the message may not contain the member.

4. Compile .proto file

Written after the proto file can be used Protobuf compiler to compile the file into the target language.

In this example we will use C ++.

Assuming that your proto files stored in $ SRC_DIR Below, you also want to generate the next file in the same directory, you can use the following command:

 protoc -I=$SRC_DIR --cpp_out=$DST_DIR $SRC_DIR/addressbook.proto

Command will generate two files:

lm.helloworld.pb.h, C ++ header file defines the class

lm.helloworld.pb.cc, C ++ class implementation file

In the generated header file defines a C ++ class HelloWorld, Reader and Writer behind this class will be used to operate on the message. We have a corresponding method such as members of the assignment message, the message sequence of the like.

5, write writer and Reader

As described above, a structured Writer will write data to the disk, so that other people to read. If we do not use Protobuf, in fact, there are many choices. One possible approach is to convert the data to a string, the string is then written to disk. The method may be converted into a string using sprintf (), which is very simple. Numeral 123 may become the string "123."

This does not seem to do anything wrong, but consider carefully will find that this approach to that person asked to write Reader is relatively high, the authors must Reader Writer's details. For example, "123" may be a single number 123, but may be three numbers 1, 2 and 3, and the like. So to say, we must also let Writer define a separator character, like so Reader can read correctly. But delimiters and perhaps cause any other problems. Finally, we found a simple Helloworld also need to write a lot of code to handle the message format.

If you use Protobuf, then these details may not be required to consider the application.

Use Protobuf, Writer work is simple, structured data to be processed is described by .proto file, after the compilation process in the previous section, the data structure corresponding to a C ++ class, and define lm.helloworld. pb.h in. For this example, the class name lm :: helloworld.

Writer needs to include the header file, then you can use this kind of.

Now, in the Writer code, to be stored in the disk data structure is represented by an object class lm :: helloworld, which provides a series of get / set function is used to read and modify the data members in the data structure , or call the field.

When we need to save the configuration data to disk, lm :: helloworld class has to provide a way to complex data into a sequence of bytes, we can use this sequence of bytes written to disk.

For a program wants to read the data, it only needs to use the corresponding class lm :: helloworld deserialization method to the structured data byte sequence will be re-converted. This is the same idea that "123" We started a similar, but far Protobuf want us to be more comprehensive than the rough string conversion, therefore, let us rest assured that this type of thing to Protobuf it.

Listing 2 demonstrates the main code Writer, you will certainly find it very simple, right?

//清单 2. Writer 的主要代码
 \#include "lm.helloworld.pb.h"
 int main(void) 
 { 
  lm::helloworld msg1; 
  msg1.set_id(101); 
  msg1.set_str(“hello”); 
  // Write the new address book back to disk. 
  fstream output("./log", ios::out | ios::trunc | ios::binary); 
  if (!msg1.SerializeToOstream(&output)) { 
     cerr << "Failed to write msg." << endl; 
     return -1; 
  }         
  return 0; 
 }

Helloworld Msg1 is an object class, set_id () to set the value of id. SerializeToOstream fstream stream writes a target sequence.

Listing 3 shows the code reader of the main.

//清单 3. Reader
 \#include "lm.helloworld.pb.h" 
 void ListMsg(const lm::helloworld & msg) { 
  cout << msg.id() << endl; 
  cout << msg.str() << endl; 
 } 
 int main(int argc, char* argv[]) { 
  lm::helloworld msg1; 
  { 
    fstream input("./log", ios::in | ios::binary); 
    if (!msg1.ParseFromIstream(&input)) { 
      cerr << "Failed to parse address book." << endl; 
      return -1; 
    } 
  }  
  ListMsg(msg1); 
  … 
 }

Similarly, Reader object class declaration helloworld msg1, then use to read information from a ParseFromIstream fstream stream and deserialized. Thereafter, the internal information ListMsg get method employed to read messages, and print output operation.

Operating results

//运行 Writer 和 Reader 的结果如下:
 \>writer 
 \>reader 
 101 
 Hello

Reader reads the sequence information in the log file and print to the screen. All code examples in this article can be downloaded in the Annex. You can see for yourself.

This example does not have significance, but slightly modified as long as you can turn it into a more useful program. For example, the disk is replaced with a network socket, then you can achieve the exchange of task-based data networks. The storage and exchange is the most effective Protobuf applications.

2, and other relatively similar technologies

After reading this simple example, I hope you have been able to understand what Protobuf do, then you might say, in the world there are many other similar techniques ah, such as XML, JSON, Thrift and so on. Compared with them, Protobuf What is the difference?

The main advantage to simply Protobuf is: simple, fast.

This test compares these similar technique as evidence, project thrift-protobuf-compare, Figure 1 shows a test result of the project, Total Time.

1. The performance test results of FIG.

pb1

Total Time refers to a time of the entire operation of the object, including creating an object, the object is serialized into a sequence of bytes in memory, then the entire process sequence of the trans. From the test results can be seen Protobuf good grades, interested readers to the Web site on http://code.google.com/p/thrift-protobuf-compare/wiki/Benchmarking more detailed test results .

1, Protobuf advantages

Protobuf like XML, but it is smaller, faster and easier. You can define their own data structure, then the code generator generates code to read and write the data structure. You can even update the data structure without the need to re-deploy the program. Just use Protobuf a description of the data structure, you can use a variety of languages ​​or structured data easily read and write to you from a variety of different data streams.

It has a great characteristic, the "backward" compatibility, and people do not have to destroy deployed, the program relies on "old" data format can be upgraded to the data structure. So that your program can not worry about the problem because large-scale code change message structure caused by remodeling or migration. Because adding a new message in the field and does not cause any change in the program has been released.

Protobuf semantics more clear, without something similar XML parser (because .proto Protobuf compiler will generate corresponding file is compiled based access to data serialization, deserialization operation Protobuf data).

Use Protobuf without having to learn complicated document object model, programming model Protobuf more friendly, easy to learn, and it has good documentation and examples for people who like things simple terms, Protobuf more attractive than other technologies.

2, Protobuf deficiencies

Protbuf and XML are also inadequate in comparison. It features a simple, can not be used to represent complex concepts.

XML has become the industry standard for a variety of authoring tools, Protobuf but Google's internal tools used in the versatility of a lot worse.

Since the text data is not suitable to describe the structure, it is not suitable for Protobuf (e.g., HTML) for text-based modeling of the markup document. In addition, because XML is to some extent self-explanatory, it can be read directly edit the people, at this point Protobuf No, it's stored in binary, unless you have .proto defined otherwise, you can not directly read Protobuf of any content.

3, advanced topics

1, more sophisticated Message

Up to this point, we only give a simple example of no use. In practice, people often need to define more complex Message. We use the "complexity" of the word, not only refers to the number of speaking more fields or more types of fields, but rather more complex data structures:

Nested Message

Nesting is a fantastic concept, once you have nested ability, skills will be very powerful message.

Listing 4 shows an example of a nested Message.

//清单 4. 嵌套 Message 的例子
 message Person { 
  required string name = 1; 
  required int32 id = 2;        // Unique ID number for this person. 
  optional string email = 3; 
  enum PhoneType { 
    MOBILE = 0; 
    HOME = 1; 
    WORK = 2; 
  } 
  message PhoneNumber { 
    required string number = 1; 
    optional PhoneType type = 2 [default = HOME]; 
  } 
  repeated PhoneNumber phone = 4; 
 }

Person in the Message, the definition of the PhoneNumber nested message, and to define the domain Person phone message. This allows one to define more complex data structures.

2、Import Message

In one .proto file, message definitions may also be introduced in other .proto Import keywords with documents, which could be called the Message Import, or Dependency Message.

Cases such as the following:

//清单 5. 代码
 import common.header; 
 message youMsg{ 
  required common.info_header header = 1; 
  required string youPrivateData = 2; 
 }

Wherein, common.info_header common.header defined within the package.

Import Message usefulness lies in providing easy code management mechanism, similar to the C language header file. You can define some common Message in a package, and then introduced the package in another .proto file, and then use the message definition of them.

Google Protocol Buffer can be a good support and the introduction of nested Message Message, so that complex work of data structure definitions become very relaxed and happy.

3, dynamic compilation

Under normal circumstances, people will use Protobuf first written .proto file, and then Protobuf compiler generates source code files in the target language required. The compiler generated code and together these applications.

However, under certain circumstances, and one can not know in advance .proto files, they need to dynamically deal with some unknown .proto file. For example, a common message forwarding middleware, it is impossible to predict how to deal with the news. This requires dynamic compilation .proto file, and use the Message of them.

Protobuf provides google :: protobuf :: compiler package to complete the dynamic compilation feature. The main class is called importer, defined in importer.h in. Importer very simple to use, the following figure shows the relationship with the Import and several other important classes.

FIG class 2. Importer

pb2

Import object class contains three main objects, respectively, error handling MultiFileErrorCollector class, the class definition .proto SourceTree source file directory.

The following examples illustrate the relationship or through these classes and use it.

For a given proto file, such as lm.helloworld.proto, in the program dynamically compiled some of it requires very little code. As shown in Listing 6.

//清单 6. 代码
 google::protobuf::compiler::MultiFileErrorCollector errorCollector;
 google::protobuf::compiler::DiskSourceTree sourceTree; 
 google::protobuf::compiler::Importer importer(&sourceTree, &errorCollector); 
 sourceTree.MapPath("", protosrc); 
 importer.import(“lm.helloworld.proto”);

First, construct a importer object. Constructor requires two parameters of the entrance, it is a source Tree object that specifies the source directory files stored .proto. The second parameter is an error collector object that has a AddError method to handle syntax errors encountered while parsing .proto file.

After that, when the need to dynamically compile a .proto file, simply call the import method importer of the object. very simple.

So how do we use the Message after the dynamic compilation it? We need to understand a few other classes

compiler provides several class Package google :: protobuf ::, .proto used to represent a message defined in the file, and the Message Field, as illustrated in FIG.

Figure 3. The relationship between the various classes Compiler

pb3

represent a class file FileDescriptor .proto compiled; the class to be a Message Descriptor file; describes a particular class FieldDescriptor Field of a Message.

After compiling such lm.helloworld.proto, lm.helloworld.id can be defined by the following code:

//清单 7. 得到 lm.helloworld.id 的定义的代码
const protobuf::Descriptor *desc = 
importer_.pool()->FindMessageTypeByName(“lm.helloworld”); 
const protobuf::FieldDescriptor* field = 
desc->pool()->FindFileByName (“id”);

By Descriptor, FieldDescriptor various methods and properties, applications can obtain various information on the Message definitions. For example, to get the name field by field-> name (). This way, you can use a dynamic definition of news.

4, write a new proto compiler

With Google Protocol Buffer source code distributed with the compiler protoc supports three programming languages: C ++, java, and Python. But the use of Google Protocol Buffer Compiler package, you can develop support for other languages ​​of the new compiler.

CommandLineInterface protoc class encapsulates the compiler front end, including parsing the command line parameters, proto compiled file functions. You need to do is to achieve class CodeGenerator derived class, implement back-end work such as generating code:

The general framework of the program as shown:

Compiler block diagram of FIG. 4. XML

pb4

In the main () function within, the cli CommandLineInterface generating object, call its RegisterGenerator () method of the back-end code generator yourG new language object is registered to cli object. Cli then call the Run () method can be.

Such compiler generated the same protoc and method of use, receive the same command line parameters, .proto cli user will input grammar lexical analysis, a syntax tree generated finally. The tree structure shown in FIG.

5. FIG syntax tree

pb5

FileDescriptor object is a root node (refer to "dynamic compiler" a), and is passed as an input parameter yourG Generator () method. In this method, you can traverse the syntax tree, and then generate the code you need corresponding. Simply put, in order to achieve a new compiler, you only need to write a main function, and a realization of the method Generator () in the derived class can be.

Download attachments in this article, there is a reference example, the compiler generates an XML file .proto the compiler, it can be used as a reference.

4, further details of Protobuf

People have been stressed, compared with XML, Protobuf main advantage is the high performance. It is stored in an efficient binary, XML less than 3 to 10 times, 20 to 100 times faster.

For these "small 3-10 times", "20 to 100 times faster," saying that serious programmers need an explanation. So at the end of this article, let's dive a little deeper inside Protobuf to achieve it.

There are two technologies to ensure that the program uses Protobuf can be obtained with respect to the XML significant performance increase.

The first point, we can examine the contents of the information Protobuf serialization. You can see represent Protocol Buffer information is very compact, which means reducing the volume of messages, requires fewer natural resources. For example the number of bytes transmitted on the network less, less like the IO required to improve performance.

The second point we need to understand the general process of unpacking Protobuf seal, in order to understand why a lot faster than XML.

1、Google Protocol Buffer 的 Encoding

After Protobuf serialization generated binary message is very compact, thanks to the very clever use Protobuf Encoding method.

Before looking at the structure of the message, so I'll start with a term called Varint.

Varint is a compact representation of the digital method. It uses one or more bytes to represent a number, a smaller numeric value using the less number of bytes. This can reduce the number of bytes used to represent numbers.

For example, a digital type int32, generally require four byte represented. But using Varint, for very small numbers of type int32, it can be represented by one byte. Of course, everything has its good and bad side, using Varint notation, large numbers you need to 5 byte to represent. From a statistical point of view, generally not all numbers are large numbers of messages, so in most cases, after use Varint, fewer characters may be used to represent digital information. The following details about Varint.

Varint of each byte of the highest bit has a special meaning, if the bit is 1, represents the subsequent byte is also a part of the digital, if the bit is 0, then ends. The other 7 bit are used to represent numbers. Thus a number less than 128 may be represented by a byte. Number greater than 128, such as 300, will be represented by two bytes: 1,010,110,000,000,010

The following illustration shows how to parse Google Protocol Buffer two bytes. Noted before final calculation of the position of the two byte exchange once, because Google Protocol Buffer using the little-endian byte order manner.

FIG 6. Varint encoding

pb6

After the message will be a sequence of binary data stream, the data stream as a series of Key-Value pairs. As shown below:

图 7. Message Buffer

pb7

Key-Pair With this structure without using the divided different Field delimiters. For the optional Field, if the field does not exist in the message, then the final Message Buffer in the field did not, these features help to conserve the size of the message itself.

With the code in Listing 1 message, for example. Suppose we generate a message following Test1:

Test1.id = 10; 
Test1.str = “hello”;

Message Buffer in the final two Key-Value pairs, a message corresponding to the id; other corresponding str.

Key used to identify a specific field, at the time of unpacking, Protocol Buffer according to the Key Value can know the corresponding message should correspond to which one of the field.

Key is defined as follows:

 (field_number << 3) | wire_type

We can see Key consists of two parts. The first part is field_number, such as message field id lm.helloworld in the field_number is 1. The second part wire_type. Value indicates the type of transmission.

Wire Type Possible types shown in the following table:

表 1. Wire Type

Type Meaning Used For
0 Varint int32, int64, uint32, uint64, sint32, sint64, bool, enum
1 64-bit fixed64, sfixed64, double
2 Length-delimi string, bytes, embedded messages, packed repeated fields
3 Start group Groups (deprecated)
4 End group Groups (deprecated)
5 32-bit fixed32, sfixed32, float

In our example, the data type field id Int32 is used, thus the corresponding wire type is 0. Careful readers might see int32 and sint32 these two very similar data types Type 0 in the data type that can be represented. Google Protocol Buffer distinguish their main intention is also to reduce the number of bytes after encoding.

In the computer, a negative number is typically represented as a large integer, is defined as a negative sign bit computer as the highest digit. If Varint represent a negative number, then you need five byte. For this reason Google Protocol Buffer defines sint32 this type, the use of zigzag coding.

Zigzag encoded with unsigned numbers to represent signed numbers, positive and negative numbers staggered, this is the meaning of the word zigzag.

As shown:

FIG 8. ZigZag encoding

pb8

Use zigzag encoder, a smaller absolute numbers, regardless of the sign of the byte can be represented with fewer, Varint full advantage of this technology.

Other types of data, such as strings and other similar representation is used in the database varchar, i.e., length represented by a varint, and then to the rest immediately after this length section.

Through the above description of protobuf Encoding method, you must have found little protobuf message content, suitable for network transmission. If you lack patience and interest to those described in the technical details, the following simple and intuitive comparison should give you more impressed.

Listing 1 for the message, in bytes of the sequence is the sequence Protobuf:

 08 65 12 06 48 65 6C 6C 6F 77

If using XML, then something like this:

31 30 31 3C 2F 69 64 3E 3C 6E 61 6D 65 3E 68 65 
6C 6C 6F 3C 2F 6E 61 6D 65 3E 3C 2F 68 65 6C 6C
6F 77 6F 72 6C 64 3E  

A total of 55 bytes, these strange figures need to explain a little, its meaning is represented by ASCII as follows:

<helloworld> 
    <id>101</id> 
    <name>hello</name> 
</helloworld>
2, the speed of closure unpack

First, we look at letters unpacking process XML. XML requires a string read from the file, and then converted to an XML document object model structure. Thereafter, the re-read the string specified node from the XML document object model structure, and then finally converted into a string of the specified type variable. This process is very complex, which converts the XML file structure for the document object model typically takes to complete lexical grammar analysis of complex calculations consume a lot of CPU.

Protobuf other hand, it requires simply a binary sequence, in accordance with the specified format to the read type C ++ structure corresponding to it. From the description of a decoding process can be seen by the expression of several messages can also calculate the shift operation to complete the composition. the speed is very fast.

To illustrate this I was not free to racking our brains to come out of the argument, let us briefly analyze the code flow Protobuf unpack it.

To in Listing 3 Reader, for example, the program first calls ParseFromIstream msg1 method, this method to read from the file parsing binary data stream and assigns parsed data corresponding data members helloworld class.

This process can be represented by the following diagram:

9. The flowchart of FIG unpacking

pb9

Protobuf entire resolution process itself requires a frame generated by the code and the code together to complete Protobuf compiler. Message Protobuf provides base classes and Message_lite provided as a common Framework ,, CodedInputStream class, WireFormatLite the like functions to decode binary data, from the analysis point of view of 5.1, the decoding may be performed by several Protobuf simple mathematical operation is completed, No complex grammar lexical analysis, therefore ReadTag () methods are very fast. Other classes and methods in the call path is very simple, interested readers can read on their own. XML parsing process with respect to the above flowchart is really very simple, right? This is the second reason for the high efficiency of the Protobuf.

Guess you like

Origin www.cnblogs.com/littlepage/p/11293833.html