xml parser non-standard C ++ implementation: First, basic data structure design thinking

Preface:

I use C ++ projects as a local xml data management simple, so far there are five years time, from the initial full-text search tab end to end, until the project is currently in the practical application of the library has been careful to line with w3c standards, I wrote a total of three times parser, I no more like xml, xml for the initial requirements are raised by customers, with the first time, there are numerous scenes later use xml configuration files, data exchange, GUI layout, Until now, a new project, it is basically the same as with the log has become a necessary function, even in the current situation I have achieved its alternative json, I still think it has enough strong vitality, give a simple example , users see json feel strenuous, can not read, but the user sees xml is very easy, the first customer, we know everything. Therefore, cross-platform applications, and its status is still strong. Write this post, it means I'm the fourth write xml parser.

 

I wrote the third xml parsing, xml declaration in addition to what has been basically in line with xml1.0 standards, and document type definition, namespace, all in accordance with the w3c specification to design, speaking quite funny, these things, standards-compliant, realize spent a lot of time, but over the past two years, not once have I been applied to the project, there is no practical scenario, where the documents required to carry logic, I have dealt with have been clear in C ++.

 

A few years ago, I first made reference to html, getElementById, element.innerHTML things like in the js is necessary to achieve.

From element.innerHTML point of view, if not the details of processing, memory footprint in this way with the document depth increases, there is no limit.

Therefore, a complete xml node program to carry it all the child nodes of the text, is not desirable.

 

So when I thought about a solution:

dom there is only one complete document

Analytical data specific node, uniform length use position and stored,

Needs to fetch the document data directly copied from the full length and the position in the document out,

 

In this way, began to write when it is good, until finally writes delete, change, index logic, only to find I thought to pit, pit has Xiang, Xiang, there are maggots, maggots in the poison @ # ¥%.

Because the worst case, a modified document data via dom, all nodes need to traverse the entire document to the new location and the length of the change, but because the project was in rush, not enough time for me to be reversed, so only hardened their so dry scalp, fortunately, with a few years, not a big problem, but has no time, the problem in recent years has been my biggest heart, yes, I have obsessive-compulsive disorder.

 

Well, lessons learned enough, before I start writing code in here to do your homework concrete implementation, at the same time, to want to do, people are doing this mind you can avoid detours .

I began to think about it: a combination of lessons, this time I intend to split into a complete document string list to store in memory, with some simple documentation to simulate, how to do it.

This time I do not intend to implement part of the contents of the document carried by the logic of demand in the standard, such as DTD, namespace I intend discarded.

<a>
    <b>1</b>
</a>

The above document, split string list: { "<a>", "<b>", "1", "</ b>", "<a>"}, in C ++, using std :: list <std :: string> to store.

The node data structure should be designed to:

struct xnode{

    std :: list <std :: string> :: iterator tag_name; // label name

    std :: list <std :: string> :: iterator inner_begin, inner_end; // internal text inclusive

    std :: list <xnode> childs; // child node

    xnode * parent; // parent node

    std :: list <xnode> :: iterator self; // iteration their position in the parent node, nodes before and after it, after copying by the operator ++ - to obtain.

};

---------------------------------------------------

xnode root;

After parsing the document,

root.tag_name => "<a>"

root.inner_begin => "<b>"

root.inner_end => "</a>"

root.childs.begin () is the label <b> node, I'm here to represent it temporarily with a b.

b.tag_name => "<b>";

b.inner_begin => "1";

b.inner_end => "</b>";

When this way, I need to achieve access logic innerText, just:

std::string str;

for(auto i = elem.inner_begin; i != elem.inner_end; ++i)

    str + = * s;

The first step does not seem to pit, hope this is the right direction, and then use a little more complicated to look at the document:

<a attr1='1' attr2 = "2">
    <b attr1='1' attr2 = "2">xxx</b>
</a>

Tags related to property, the situation becomes more complicated

First, the property includes a label name, if split off, there may be a lot of character, but also made a std :: string to store the problem.

Then, the performance of the parser is also reduced, while the subsequent innerText string concatenation, will be affected.

In addition it is necessary to be born a container used to store the tag names, attribute?

std :: map <std :: string, std :: list <xnode * >>, can be achieved simultaneously record label name, and getElementByTagName achieve this kind of thing under the label index.

Attribute name is usually on the definition, equal to the constant, the probability of re-use will be great, so it should be: std :: set <std :: string>?

Property values ​​are usually variable, the probability of a big variable, using a uniform manner with the property name does not seem very suitable, but property values ​​seem to repeat the same number of strings that may occur, for example, like true false.

Therefore attribute value, should be designed to: std :: map <std :: string, unsigned int> val the design reference count is 0, erase off, EMM .. What is unlikely to resolve neuropathy 4000000000 document node, so unsigned int enough.

 

So, to think that, roughly the document source data structure out:

struct xsource{

    std::list<std::string> docs;

    std::map<std::string, std::list<xnode*>> tags;

    std::set<std::string> attr_names;

    std::map<std::string, unsigned int> attr_values;

};

 

Changes in the structure of the resulting xnode followed:

struct xattr{

    std::set<std::string>::iterator name;

    std::map<std::string, unsigned int> value;

};

struct xnode{

    std::map<std::string, std::list<xnode*>>::iterator tag;

    std :: list <xnode *> :: iterator itag; // tag is used to delete, delete the node pointer from xsource.docs.

    std::list<xattr> attrs;

    std::list<std::string>::iterator inner_begin, inner_end;

    std::list<xnode> childs;

    xnode *parent;

    std::list<xnode>::iterator self;

};

To start thinking about this tonight, I'll first look at the preliminary press realize this idea.

 

To be continued ...

 

Guess you like

Origin www.cnblogs.com/babypapa/p/11785051.html