vChain: Enabling Verifiable Boolean Range Queries over Blockchain Databases(sigmod‘2019)


Abstract

Blockchain has been in the spotlight lately due to the boom in cryptocurrencies and decentralized applications. There is an increasing need to query data stored in blockchain databases. In order to ensure the integrity of the query, users can maintain the entire blockchain database and query the data locally. However, due to the massive data volume and considerable maintenance costs of blockchains, this approach is uneconomical, if not infeasible. In this paper, we take the first step towards investigating the problem of verifiable query processing on blockchain databases. We propose a new framework, called vChain, that offloads storage and computation costs for users, and employs verifiable queries to guarantee the integrity of results. To support verifiable Boolean range queries , we propose an accumulator-based authentication data structure that enables dynamic aggregation of arbitrary query attributes. Two new indexes are further developed to aggregate intra-block and inter-block data records for efficient query verification . We also propose an inverted prefix tree structure to simultaneously speed up the processing of a large number of subscription queries . Safety analysis and empirical studies verify the robustness and practicality of the proposed technique.


1 Introduction

Blockchain technology has gained overwhelming momentum in recent years due to the success of cryptocurrencies such as Bitcoin [1] and Ethereum [2]. A blockchain is an append-only data structure that is stored distributed among peers in a network. Although peers in the network may not trust each other, blockchain ensures data integrity in two ways. First of all, with the support of hash chain technology, the data stored on the blockchain is immutable. Second, due to its consensus protocol, the blockchain guarantees that all peers maintain the same copy of the data. These cryptographically guaranteed security mechanisms, coupled with the decentralization and provenance features of blockchain, make blockchain a potential technology to revolutionize database systems [3, 4, 5, 6, 7].

From a database perspective, a blockchain can be seen as a database that stores a large number of time-stamped data records. With the widespread adoption of blockchain in data-intensive applications such as finance, supply chain, and intellectual property management, there is an increasing need for users to query data stored in blockchain databases. For example, in the Bitcoin network, users may wish to find transactions that satisfy various range selection predicates, such as “transaction fee ≥ $50” and “$0.99 million ≤ total output ≤ $1.01 million” [8]. In a blockchain-based patent management system, users can use Boolean operators to search for keyword combinations in patent abstracts, such as "blockchain" ∧("query" ∨ "search") [9]. While many companies, including database giants IBM, Oracle, and SAP, as well as startups such as FlureeDB [10], BigchainDB [11], and SwarmDB [12], are working on developing blockchain database solutions to support SQL-like queries, all of which Both assume the existence of a trusted party that can faithfully execute user queries based on materialized views of the blockchain database. However, such trusted parties may not always exist, and the integrity of query results cannot be guaranteed. Query processing with integrity guarantees remains an unexplored problem in blockchain research.

In a typical blockchain network [1, 2],1, as shown in Figure 1, there are three types of nodes: full nodes, miners, and light nodes. Full nodes store all data in the blockchain, including block headers and data records. Miners are full nodes with powerful computing power responsible for building consensus proofs (such as nonces in the Bitcoin blockchain). Light nodes only store block headers, which include the consensus proof and cryptographic hash of the block. Note that data records are not stored in light nodes.
In order to ensure the integrity of the blockchain database query, query users can join the blockchain network as a full node. Users can then download and validate the entire database and process queries locally without compromising the integrity of the query. However, maintaining a full copy of the entire database may be too costly for the average user, as it requires significant storage, computing, and bandwidth resources. For example, the minimum requirements for running a Bitcoin full node include 200GB of free disk space, an unmetered broadband connection with an upload speed of at least 50KB per second, and 6 hours of daily runtime [13]. To cater to query users with limited resources, especially mobile users, a more attractive alternative is to delegate storage and query services to a powerful full node, while query users only act as a light node to receive results. However, how to ensure the integrity of query results is still a challenge, because full nodes are untrustworthy, which is an inherent assumption of the blockchain.

To address the above query integrity issues, in this paper, we propose a new framework, called vChain, which employs verifiable query processing to guarantee the integrity of results. More specifically, we augment each block with some additional Authenticated Data Structure (ADS), based on which a (untrusted) full node can construct and return a ) for the user to verify the result of each block query. The communication between the query user (light node) and the full node is shown in Figure 1, where Q represents the query request and R represents the result set.

It is worth noting that this vChain framework is inspired by query authentication techniques researched for outsourced databases [14, 15, 16, 17, 18]. However, there are several key differences that make traditional techniques unsuitable for blockchain databases. First, traditional techniques rely on the data owner to sign the ADS with a private key . In contrast, there are no data owners in a blockchain network. Only miners can append new data to the blockchain by building consensus proofs according to the consensus protocol. However, they cannot act as data owners because they cannot hold the private keys and sign ADSs. Secondly, the traditional ADS is built on a fixed data set and cannot effectively adapt to the blockchain database with unbounded data. Third, in traditional outsourced databases, new ADSs can always be generated and appended as needed to support more queries involving different attribute sets. However, this will be difficult due to the immutability of the blockchain, where a one-size-fits-all ADS is better suited to support dynamic query properties.

Clearly, the design of ADS is a key issue for the vChain framework. To address this issue, this paper focuses on Boolean range queries, which, as mentioned earlier, are common in blockchain applications [8, 9]. We propose a novel accumulator-based ADS scheme that enables dynamic aggregation of arbitrary query attributes , including numeric and set-valued attributes . This newly designed ADS is independent of the consensus protocol and thus compatible with current blockchain technology. Based on this, an efficient verifiable query processing algorithm is developed. We also propose two authenticated index structures for intra-block data and inter-block data, respectively, to enable batch verification. To support large-scale subscription queries, we further propose a query indexing scheme that can group similar query requests. In summary, our contributions in this paper are as follows:

• To the best of our knowledge, this is the first verifiable query processing effort to leverage built-in ADS for query integrity of blockchain databases.
• We propose a new vChain framework, along with a new ADS scheme and two index structures that can aggregate intra-block and inter-block data records for efficient query processing and validation.
• We developed a new query index that can handle a large number of subscription queries simultaneously.
• We conduct security analysis and empirical studies to validate the proposed techniques. We also address practical implementation issues.

The remainder of this paper is organized as follows. Section 2 reviews existing research on blockchains and verifiable query processing. Section 3 presents the formal problem definition, followed by cryptographic primitives in Section 4. Section 5 presents our basic solution, which is then improved by two index structures designed in Section 6. Section 7 discusses verifiable subscription queries. Section 8 presents the security analysis. Section 9 presents the experimental results. Finally, we conclude our paper in Section 10.

Example: pandas is a NumPy-based tool created to solve data analysis tasks.

2 related work

In this section, we briefly review related research and discuss related techniques.
blockchain. Since the launch of the Bitcoin cryptocurrency, blockchain technology has received significant attention from academia and industry [1,2,5]. A blockchain is essentially a special form of Merkle hash tree (MHT) [19], built as a series of blocks. As shown in Figure 2, each block stores a list of transaction records and an MHT built on top of them. The header of each block consists of four parts: (i) PreBkHash, which is the hash of the previous block; (ii) TS, which is the timestamp when the block was created; (iii) ConsProof, which is constructed by miners and guarantees Consensus of blocks; (iv) MerkleRoot, which is the root hash of MHT. ConsProof is usually calculated based on PreBkHash and MerkleRoot, and changes according to the consensus protocol. In the widely used ProofofWork (PoW) consensus protocol, ConsProof is a nonce calculated by miners:
hash(PreBkHash | TS | MerkleRoot | nonce) ≤ Z
where Z corresponds to the mining difficulty. After the miners find the random number, they will package the new block and broadcast it to the whole network. Other miners verify the transaction records and the nonce for the new block, and once verified, append it to the blockchain.

In order to solve various problems of blockchain systems, including system protocol [20, 21], consensus algorithm [22, 23], security [24, 25], storage [7] and performance benchmarking [4], it has been done A great effort has been made]. Recently, major database vendors including IBM [26], Oracle [27], and SAP [28] have integrated blockchain with their database management systems, which allow users to execute queries on the blockchain through the database front-end. Furthermore, many startups such as FlureeDB [10], BigchainDB [11] and SwarmDB [12] have been developing blockchain-based database solutions for decentralized applications. However, they typically separate query processing from the underlying blockchain storage and rely on trusted database servers to guarantee query integrity. In contrast, our proposed vChain solution builds authenticated data structures into the blockchain structure, so even untrusted servers can provide integrity-guaranteed query services.

Verifiable query processing. Verifiable query processing techniques have been extensively studied to ensure result integrity against untrusted service providers (eg, [14, 15, 16, 17, 18, 29]). Most existing research focuses on outsourced databases, and there are two typical approaches: using circuit-based verifiable computing (VC) techniques to support general queries and using authenticated data structures (ADS) to support specific queries. VC-based approaches (eg, SNARKs [30] ) can support arbitrary computational tasks, but at the expense of very high and sometimes impractical overhead. Furthermore, it requires expensive preprocessing steps, since both the data and the query procedure need to be hard-coded into the attestation and verification keys. To address this issue, Ben-Sasson et al. [31] developed a variant of SNARK in which the preprocessing step only depends on the upper bound size of the database and query program. Recently, Zhang et al. [29] proposed a vSQL system that utilizes an interactive protocol to support verifiable SQL queries. However, it is limited to relational databases with a fixed schema.

In contrast, the ADS-based approach is usually more efficient because it is tailored to specific queries. Our proposed solution belongs to this approach. Two types of structures are commonly used as ADSs: digital signatures and MHTs. Digital signatures authenticate the content of digital messages based on asymmetric cryptography. To support verifiable queries, it needs to sign each data record and thus cannot scale to large datasets [14]. On the other hand, MHT is built on hierarchical trees [19]. Each entry in a leaf node is assigned a hash digest of the data record, and each entry in an internal node is assigned a digest derived from a child node. The data owner signs the root digest of the MHT, which can be used to verify any subset of data records. MHT has been widely applied to various index structures [15, 16, 17]. Recently, verifiable queries on set-valued data have been studied [32, 33, 34, 35, 36].

Another closely related research direction is verifiable query processing of data streams [37, 38, 39, 40]. However, previous studies [38, 39] focused on one-off queries to retrieve the latest version of streaming data . [40] requires data owners to maintain one MHT for all data records, and has long query latency, which is not suitable for real-time streaming services. On the other hand, subscription queries over data streams are studied in [41, 42, 43] . So far, no work has considered the integrity issue of blockchain database subscription queries.

problem definition

As described in Section 1, this paper proposes a novel vChain framework and investigates verifiable query processing on blockchain databases. Figure 3 shows the system model vChain, involving three parties: (i) miners, (ii) service providers (SP) and (iii) query users. Both miners and SPs are full nodes that maintain the entire blockchain database. The querying user is a light node that only tracks block headers. Miners are responsible for building consensus proofs and appending new blocks to the blockchain. SP provides query services for lightweight users.
Data stored in the blockchain can be modeled as a sequence of temporal object blocks {o1, o2, , on}. Each object oi is denoted by 〈ti,Vi,Wi〉, where ti is the timestamp of the object, Vi is a multidimensional vector representing one or more numerical attributes, and Wi is a set-valued attribute. To enable verifiable query processing, an Authenticated Data Structure (ADS) (detailed in Sections 5-7) is built by miners and embedded into each block. We consider two forms of boolean range queries: (historical) time window queries and subscription queries.

Time window query. A user may wish to search for records that occur within a certain time period. In this case, a time window query can be issued. Specifically, a time window query is of the form q = ⟨[ts, te ], [α, β], ϒ⟩, where [ts, te ] is the time range selection predicate for the time range and [α , β] is the numeric value A multidimensional range selection predicate for attributes, ϒ is a monotonic Boolean function on set-valued attributes. As a result, SP returns all such that {oi = ⟨ti,Vi,Wi ⟩ | ti ∈ [ts, te ]∧Vi ∈ [α, β] ∧ ϒ(Wi ) = 1}. For simplicity, we assume that ϒ is in Conjunctive Normal Form (CNF).
Example 3.1. In the Bitcoin transaction search service, each object oi corresponds to a coin transfer transaction. It consists of the transfer amount stored in Vi and a set of sender/receiver addresses stored in Wi. A user can issue the query q = ⟨[2018-05, 2018-06], [10, +∞], send:1FFYc∧receive:2DAAf⟩ to find all transactions with a transfer amount greater than 10 that occurred between May and June 2018 And associated with addresses "send:1FFYc" and "receive:2DAAf".

Subscribe to inquiries. In addition to time window queries, users can also register their interest by subscribing to queries. Specifically, the subscription query has the form q = ⟨−, [α, β], ϒ⟩, where [α, β] and ϒ are the same as the query conditions in the time window query. In turn, SP returns all objects consecutively such that {oi = ⟨ti,Vi,Wi ⟩ | Vi ∈ [α, β] ∧ ϒ(Wi ) = 1} until the query is logged out.
Example 3.2. In the blockchain-based car rental system, each rental object consists of the rental price stored in Vi and a set of text keywords stored in Wi. A user can subscribe to the query q = ⟨−, [200, 250], “Sedan” ∧(“Benz”∨ “BMW”)⟩ to receive all lease messages whose price is in the range of [200, 250] and contains the keyword " Sedan" and "Mercedes" or "BMW".

Additional examples of time window queries and subscription queries can be found in Figure 3.

threat model. We consider SPs, as untrusted peers in blockchain networks, to be potential adversaries. Due to various issues such as program failures, security vulnerabilities, and commercial interests, SPs may return tampered or incomplete query results, thereby violating the expected security of the blockchain. To address this threat, we employ verifiable query processing, which enables the SP to prove the integrity of query results. Specifically, during query processing, the SP checks the ADS embedded in the blockchain and constructs a verification object (VO) containing the result verification information. VO is returned to the user along with the result. Using VO, users can determine the soundness and completeness of query results based on the following criteria:
• Robustness. None of the returned objects have been tampered with, and all of them meet the query conditions.
• Integrity. No valid results are missing with respect to the query window or subscription period.
The above security concepts will be formalized when we conduct security analysis in Section 8. The main challenge of this model is how to design ADS so that it can be easily accommodated in the blockchain structure, while being cost-effective VO (can be efficiently built for time window queries and subscription queries resulting in small bandwidth overhead and fast verification times) . We will address this challenge in the next few sections.

4 preliminaries

This section provides some preliminary introduction to the cryptographic structures required in our algorithm design.
Cryptographic hash function. The cryptographic hash function hash(·) accepts a string of arbitrary length as its input and returns a fixed-length bit string. It is collision resistant and it is difficult to find two distinct messages, m1 and m2, such that hash(m1) = hash(m2). Classical cryptographic hash functions include the SHA-1, SHA-2, and SHA-3 families.
Bilinear pairing. Let G and H be two cyclic multiplicative groups with the same prime order p. Let д be the generator of G. A bilinear map is a function e : G × G → H with the following properties: • Bilinear: Ifu,v ∈ G and e(u,v) ∈ H, then e(ua ,vb ) = e(u ,v)ab for any u,v.
• Non-degenerate: e(ä, ä) , 1.
Bilinear pairings are used as the basic operation of multiset accumulators, as shown later in this paper.

5 basic solutions

To enable verifiable queries in our vChain framework, a simple scheme is to build a traditional MHT as the ADS of each block and apply a traditional MHT-based authentication method. However, this naive approach has three major drawbacks. First, MHT only supports query keys for building Merkle trees. To support queries involving arbitrary sets of attributes, an exponential number of MHTs need to be built for each block. Second, MHT is not suitable for setting value properties . Third, the MHT of different blocks cannot be effectively aggregated, and inter-block optimization techniques cannot be used. To overcome these shortcomings, in this section, we propose a novel authentication technique based on a new accumulator-based ADS scheme, which converts numeric attributes into set-valued attributes and supports dynamic aggregation of arbitrary query attributes.

Below, we start by considering a single object and focus on Boolean time window queries for illustration (Sections 5.1 and 5.2). We then extend this to range query conditions (Section 5.3). We discuss batch query processing and validation of multiple objects in Section 6. Subscription queries are detailed in Section 7

5.1 ADS Generation and Query Processing

Verifiable query processing. Given a boolean query condition and a data object, there are only two possible outcomes: match or no match. The soundness of the first case can be easily verified by returning the object as a result, since its integrity can be verified by the ObjectHash stored in the block header, which is available to querying users on light nodes (recall Figure 3) .The challenge is how to efficiently verify the second case using AttDigest. Since CNF is a Boolean function represented by an AND list of OR operators, we can think of a Boolean function in CNF as a list of sets. For example, a query condition "Sedan"∧("Benz"∨"BMW") is equivalent to two sets: {"Sedan"} and {"Benz", "BMW"}. Consider a mismatched object oi: {"Van", "Benz"}. It is easy to observe that there exists an equivalence set (i.e. {"Sedan"}) whose intersection with object properties is empty. Therefore, we can apply ProveDisjoint({“Van”, “Benz”}, {“Sedan”}, pk) to generate disjoint proof π as VO of mismatched objects. Thus, a user can retrieve AttDigesti = acc({“Van”, “Benz”}) from the block header and use VerifyDisjoint(AttDigesti, acc({“Sedan”}), π, pk) to verify the mismatch. The whole process is detailed in Algorithm 1.

5.2 Structure of multiple accumulators

5.3 Extensions to range queries

6 batch verification

6.1 Intra-block index

6.2 Inter-block indexes

6.3 Online Batch Verification

7 Verifiable Subscription Query

7.1 Query indexes for scalable processing

7.2 Lazy authentication

8 Security Analysis

8.1 Multiset Accumulator Analysis

8.2 Query Authentication Analysis

9 Performance Evaluation

In this section, we evaluate the performance of the vChain framework on time window queries and subscription queries. Three datasets were used in the experiments:

• Foursquare (4SQ) [46]: The 4SQ dataset contains 1M data records, namely user check-in information. We pack records within 30s intervals into a block, each object of the form 〈timestamp, [longitude, latitude], {check-in place's keywords}〉. On average, each record has 2 keywords.

• Weather (WX): The WX dataset contains 1.5 million hourly weather records for 36 cities in the US, Canada, and Israel from 2012-2017.2. For each record, it contains seven numeric attributes (such as humidity and temperature) and one weather description attribute with an average of 2 keywords. Records within the same hourly interval are packed into a block.

• Ethereum (ETH): The ETH transaction dataset was extracted from the Ethereum blockchain from January 15, 2017 to January 30, 2017. 3 It contains 90,000 blocks and 1.12 million transaction records. Each transaction is of the form 〈timestamp, amount, {addresses}〉, where amount is the amount of ether transferred and {addresses} are the sender and receiver addresses. Most transactions have two addresses.

Note that the time intervals for block generation in 4SQ, WX, and ETH are roughly 30 seconds, 1 hour, and 15 seconds, respectively.

The query user setup was run in a single thread on CentOS 7 on a commercial laptop with an Intel Core i5 CPU and 8GB RAM. The SP and miner were set up on a x64 blade server with dual Intel Xeon 2.67GHz, X5650 CPU and 32 GB RAM, running on CentOS 7. The experiments are written in C++ using the following libraries: MCL for bilinear pairwise computation, 4 Flint for modular arithmetic operations, Crypto++ for 160-bit SHA-1 hashing, and OpenMP for parallel computation. In addition, SP(service provider(full node)) runs 24 hyperthreads to speed up query processing.

To evaluate the performance of verifiable queries in vChain, we mainly use three metrics:
(i) query processing cost in terms of SP CPU time, (full node)
(ii) result verification cost in terms of user CPU time, and
(iii) transfer The size of VO is from SP to user.
For each experiment, we randomly generate 20 queries and report the average result. By default we set the selectivity of the value range to 10% (for 4SQ and WX) and 50% (for ETH) and use disjunctive booleans of size 3 (for 4SQ and WX) and 9 (for ETH) function. For WX, each range predicate involves two attributes.

9.1 Setup costs

Table 1 reports the setup costs of miners, including ADS build time and ADS size. Three methods are compared in our experiments: (i) nil: no index is used; (ii) intra: only intra-block index is used; (iii) both: intra-block and inter-block index are used, where inter The size of the SkipList is set to 5 . Each method uses two different accumulator structures (labeled acc1 and acc2 ) described in Section 5.2. Therefore, a total of six scenarios were evaluated in each experiment. Unsurprisingly, ADS build times for both are generally longer than nil and intra, but still within 2s in most cases. Furthermore, acc2 significantly reduces the build time of both compared to acc1 because it supports online aggregation so that the index of the previous block can be reused when building an inter-block index. Regarding the ADS size, which is independent of the accumulator used, it ranges from 2.6KB to 11.1KB per block for different indexes and datasets.

We also measure the space required for users to run light nodes to maintain block headers. For nil and intra, the size of each block header is 800 bits, regardless of dataset or accumulator. Both have slightly increased block header sizes to 960 bits due to inter-block indexing.

9.2 Time window query performance

To evaluate the performance of time window queries, we changed the query window from 2 hours to 10 hours for 4SQ and ETH, and from 20 hours to 100 hours for WX. The results for the three datasets are shown in Figures 9-11, respectively. We made several interesting observations. First, as expected, these indexes improve performance significantly in almost all metrics. Especially for the 4SQ and ETH datasets, the performance using indexes is at least 2 times better than using the same accumulators without any indexes. This is because objects in these two datasets share less similarity and thus benefit more from pruning with indices. Second, the cost of index-based schemes only increases sub-linearly with larger query windows. This is especially true in terms of user CPU time for index-based schemes using acc2, which supports unmatched batch validation (see Section 6.3). Third, comparing intra and both, except for the SP CPU time of the 4SQ dataset, both are always no worse than intra. On the one hand, this demonstrates the effectiveness of using inter-block indexes. On the other hand, the reason why both are worse than intra in SP CPU time is mainly because in inter-block index based schemes, larger multisets are used as input for set disjoint proofs, which increases SP CPU time More insight into this is provided in Appendix D.3, where we examine the effect of SkipList size. For the ETH dataset, the largest improvement over the inside is observed for both. The reason is as follows. Compared with 4SQ, the shared similarity between objects in ETH is lower; compared with WX, ETH contains fewer objects in each block. For both cases, more performance improvements can be obtained by using skip lists in inter-block indexes.

9.3 Subscription query performance

We next evaluate the performance of subscription queries. First, we examine the query processing time of SPs with and without IP-Tree (denoted as ip and nip) under the default setting of enabling intra-block index and inter-block index. We randomly generate a varying number of queries. We set the default subscription period to 2 hours for 4SQ and ETH, and 20 hours for WX. As shown in Figure 12, IP-Tree reduces SP overhead by at least 50% in all test cases. The performance gain is more pronounced in the ETH dataset (Fig. 12(c)) due to the sparser data distribution.
To compare real-time and lazy authentication, we consider two real-time schemes (using acc1 and acc2) and one lazy scheme (using acc2 only, since acc1 does not support aggregation of accumulation sets and proofs). We changed the subscription period from 2 hours to 10 hours for 4SQ and ETH, and from 20 hours to 100 hours for WX. Fig. Figure 13-15 shows the result of changing the subscription period. Clearly, the lazy scheme performs better in terms of user CPU time than the realtime scheme. Furthermore, the CPU time and VO size in the lazy scheme only increase sub-linearly with the subscription period. This is because the lazy scheme can aggregate proofs of mismatched objects across blocks. In contrast, real-time schemes compute all proofs as soon as a new block arrives, resulting in worse performance. In terms of SP CPU time, lazy schemes generally perform worse than real-time schemes when using the same accumulators, since they need to sacrifice SP's computation to aggregate mismatch proofs.

10 Conclusion

In this paper, we study the problem of verifiable query processing on blockchain databases for the first time in the literature. We propose the vChain framework to ensure the integrity of Boolean range queries for lightweight users. We develop a novel accumulator-based ADS scheme to convert numeric attributes into set-valued attributes, enabling dynamic aggregation of arbitrary query attributes. On this basis, two data indexes are designed, namely tree-based intra-block index and skip-list-based inter-block index, and a subscription query index based on prefix tree, and a series of optimizations are carried out. While our proposed framework has been shown to be practically implementable, the robustness of the proposed technique has been demonstrated through safety analysis and empirical results.
This paper opens a new direction for blockchain research. There are many interesting research questions worthy of further study, for example, how to support more complex analytical queries; how to take advantage of modern hardware (such as multi-core and many-core) to scale performance; and how to address privacy concerns in query processing.

vChain + : Optimizing Verifiable Blockchain Boolean Range Queries ( Technical Report )

Summary

Blockchain has recently attracted a lot of attention due to the success of cryptocurrencies and decentralized applications. With immutability and tamper-proof properties, it can be considered as a promising secure database solution. To address the search needs of blockchain databases, previous work on vChain proposes a novel verifiable processing framework that ensures query integrity without maintaining a full copy of the blockchain database. However, it suffers from several limitations, including worst-case linear scan search performance and impractical public key management . In this paper, we propose a new searchable blockchain system, vChain+ , which supports efficient verifiable Boolean range queries with additional functionality . Specifically, we propose a sliding window accumulator index that enables efficient query processing even in the worst case. We also design an object registry index to enable practical public key management without compromising security guarantees. To support richer queries, we employ an optimal tree-based index to index keywords and numeric attributes of data objects . Some optimizations are also proposed to further improve query performance . Security analysis and empirical studies verify the robustness and performance improvement of the proposed system. Compared with vChain, vChain+ query performance is improved by up to 913 times.

Introduction

In recent years, due to the great success of decentralized applications in various fields such as cryptocurrencies, healthcare, and supply chain management, blockchain has received a lot of attention [1]-[3]. It is an append-only ledger built on incoming transactions agreed to by an untrusted network of nodes. Utilizing hash chains and distributed consensus protocols, blockchains are non-tamperable and tamper-proof. In a typical blockchain network, there are three types of nodes: full nodes, miners, and light nodes , as shown in Figure 1. Full nodes maintain a complete copy of blockchain data, including block headers and complete block states. A miner is also a full node, but has additional responsibilities to generate new blocks. On the other hand, light nodes do not maintain the entire blockchain. Light nodes only store block headers containing consensus proofs and block state summaries. Despite the small size, the block header provides enough information to verify the integrity of the block.

The unique properties of blockchain make it a promising solution for secure databases , especially in a decentralized environment. Therefore, there is a growing need to query data stored in blockchain databases . For example, in the Bitcoin network, a user may wish to find all transactions that transfer an amount between $1 and $10 within a certain time interval or that are associated with some specific sender and receiver addresses. Some database companies, such as IBM and Oracle, provide searchable blockchain database solutions by implementing views of blockchain data in traditional centralized databases. However, such a design is not desirable for decentralized applications. Centralized parties do not guarantee the integrity of query execution, which could be malicious or corrupted. Alternatively, users can maintain a full copy of the entire blockchain database and query the data locally. However, this is impractical for ordinary users as it requires massive storage, computing, and bandwidth resources.

To address the above issues, Xu et al. [4] proposed a vChain framework that supports verifiable Boolean range queries on blockchain databases. As shown in Figure 1, a query user in vChain only needs to act as a light node; instead, the query is outsourced to a full node in the blockchain network, which acts as a service provider (SP). Although the SP may not be trusted, the user can still verify the integrity of the query results by inspecting an additional verification object (VO). VO is computed by the SP with the help of a well-crafted authentication data structure (ADS) embedded in the block header . We will briefly discuss the basic design of the vChain framework and the challenges that limit its usefulness.

A. Vchain and its limitations

In vChain, each block header adds a specially designed ADS , AttDigest (as shown in Figure 2). AttDigest is computed from an encrypted set accumulator as a constant-size digest representing a set of data objects. It can also be used to efficiently prove that data objects in a block do not match query conditions by using set disjoint operations. For example, if the object oi in blocki has two keys {“A”, “B”}, the corresponding AttDigest is calculated as AttDigest = acc({“A”, “B”}), where acc(·) is calculated Set the cumulative value. When the user asks q = "B" ∧ "C", we can see that blocki does not match q because "C" ∩ {"A", "B"} = ∅. Therefore, SP computes a set of disjoint proofs π∅ and sends VO = {π∅, “C”} to the user. Based on the above information, the user can use π∅ and the AttDigest in the block header to determine that block i does not match the query condition "C" . In order to efficiently process multiple unmatched blocks in batches for better performance, vChain also proposes an inter-block index , which is a skiplist used to aggregate data objects across blocks . For each skip, a cumulative value is computed based on the objects in the skipped block. If a query does not match aggregate blocks due to the same mismatch query condition, we can generate a single mismatch proof to skip those blocks, reducing the query cost. Range queries in vChain are implemented by converting numeric attributes to set-valued attributes with the help of a prefix tree and following a similar query processing procedure. While vChain supports verifiable Boolean range queries in blockchain databases for the first time
, it still presents some challenges that limit its usefulness. The first is that in the worst case, inter-block indexes cannot help speed up the proof of aggregate mismatched blocks, degrading the query into a linear scan process. For example, suppose q = "A" ∧ "B" and three consecutive blocks, including keywords o1 = {"A", "C"}, o2 = {"B", "D"}, o3 = { "A", "E"}, aggregated so that the cumulative value in the inter-block index is computed from S = {"A", "B", "C", "D", "E"}. In this case, inter-block indexing cannot work because S satisfies q. This example shows that, in the worst case, vChain has to query each block individually because the inter-block index cannot aggregate multiple mismatched blocks with different mismatch reasons. With this observation, we evaluate vChain using the 4SQ dataset [5] to measure the utilization of the inter-block index for mismatched blocks. Figure 3 shows that in almost 80% of the cases, the inter-block index does not work (ie, the skip length is 0), which is consistent with the previous analysis. The second limitation of vChain is the practical issue of its public key management. Due to the nature of cryptographic accumulators, their public key size is determined by the largest possible value of the attribute in the system , which is 2^256 if the data attribute is encoded using a 256-bit hash. To circumvent this problem, vChain proposes to introduce a trusted oracle to dynamically generate public keys. However, such oracles may not exist in a decentralized environment, making vChain difficult to deploy in real-life applications. Last but not least, since vChain converts numeric attributes to set-valued attributes , it can only support integers and fixed-point numbers, which limits its application.

B、our contributions

To address the limitations of vChain, we propose a new searchable blockchain system, vChain+, which supports efficient verifiable Boolean range queries and adopts a novel ADS design, which is more efficient, practical and useful. Instead of processing blocks with a mismatch condition , we propose a novel sliding-window accumulator design for building an ADS (authentication data structure) in each block. Specifically, for each block, we build a sliding window accumulator (SWA) index over the data objects in the most recent k blocks , where k is the sliding window size. Through this design, a time window historical query q = [ts, te] is first divided into multiple subqueries, and the time window size of each subquery is k. Each subquery can then be efficiently processed and validated using the SWA index in the corresponding block.
The main improvement of SWA indexes comes from using optimal indexes to support different queries (for example, tries for keyword queries and B±trees for range queries). For example, considering the above case of three consecutive blocks, we set k = 3, a trie-based SWA index can be constructed, including keywords "A", "B", "C", "D", "E". In During query processing, we first search the SWA index to get object sets for keywords "A" and "B", which are {o1, o3} and {o2}, respectively. Then, the set intersection proof π∩ is calculated using the accumulated values ​​of the two object sets to prove that the result is ∅. Therefore, the SWA index can help speed up the proof of aggregation blocks and alleviate the inter-block indexing problem in vChain.
In addition to the SWA index, we also address the practical problem of public key management by introducing an object registry index . Note that the public key size of the cipherset accumulator depends on the size of the universe of input set elements. Since the accumulators in our SWA index are built on data objects (see keywords in vChain), we register and index each data object with a small integer ID to limit the universe size and thus the public key size . Users can use this index to retrieve the final query result from the corresponding ID with integrity guaranteed. Also, unlike vChain, which converts numeric attributes to set-valued attributes , we use a B±tree to support numeric range queries for floating-point numbers . We also consider arbitrary Boolean queries with multiple keys (see Finite monotonic Boolean queries supported in vChain).

Furthermore, we propose some optimizations to further improve system performance. We propose to build multiple SWA indexes with different sliding window sizes for each block, so that SP can choose the best one according to query conditions. At the same time, since the query may involve a series of verifiable set operations , we adopt an optimal query plan to reduce the computational overhead of the cryptographic set accumulator. We also propose to prune unnecessary set operations based on empty sets. Both security analysis and empirical studies validate the proposed method. Experimental results show that, compared with the two structures of vChain [4], the query performance of vChain+ is increased by 913 times and 1098 times respectively.

The remainder of this paper is organized as follows. Section II introduces the formal problem formulation, followed by some preliminary knowledge of the cryptographic building blocks in Section III. Section IV introduces the handling of verifiable Boolean queries, which are then extended to rich query types in Section V. Section VI presents several optimization techniques, and the security analysis is presented in Section VII. Section VIII presents the experimental results. Section 9 discusses related work. Finally, we conclude our paper in Section X.

Problem Description

As mentioned in Section 1, vChain+ follows the same system model as vChain [4], but proposes a novel ADS design to provide better query processing efficiency and functionality. SP is a full node of the blockchain and provides verifiable query services. Users are light nodes and only maintain block headers for verification. On the other hand, miners, as full nodes, are responsible for appending new blocks to the blockchain and building a specially designed sliding window accumulator (SWA) index in each block to facilitate verifiable queries. With the help of the SWA index, the SP returns the result and an additional Verification Object (VO) for result integrity verification (as shown in Figure 1).

Data objects in the blockchain are modeled as tuples of the form oi = ⟨ti, vi, Wi⟩, where ti is the timestamp of the object, vi represents the numerical attribute, and Wi is the key set of the object. In this paper, we focus on verifiable historical Boolean range queries within a specific time window. The specific query form is Q = ⟨[ts, te], [α, β], Υ⟩, where [ts, te] is the time window predicate, [α, β] is the numerical range predicate, and Υ is the object keyword set any boolean function. Different from vChain, Υ is not limited to monotonous Boolean functions, but also supports ¬ (NOT), ∧ (AND), ∨ (OR) operators, which are more expressive. Given a query, SP returns all data objects satisfying the query condition, i.e. {oi = ⟨ti, vi,Wi⟩ | ti ∈ [ts, te]∧vi ∈ [α, β]∧Υ(Wi) = 1} . For example, in the context of Bitcoin transaction data, a user might ask the query q = ⟨[2021-10, 2021-11], [10, 20], send:2AC0 ∧ ¬receive:3E7F⟩ to find all transactions that occurred In October-November 2021, transfer amount between 10 and 20, associated sender 2AC0, except receiver 3E7F.

threat model. Similar to vChain [4], we assume that the SP is untrustworthy and may return tampered or incomplete results due to various reasons such as commercial dishonesty or security breaches. On the other hand, we assume that the blockchain works functionally, i.e. most miners in the blockchain system are honest and the blockchain network is strongly synchronized. Furthermore, we assume that users are trusted and faithfully follow the protocol during query verification . Specifically, with the help of VO generated by SP, users can verify the soundness and integrity of the results. Soundness means that all returned results come from the blockchain database and meet the query conditions. Integrity means that no valid results with respect to query conditions are lost.
The goal of vChain+ is to design a novel ADS that helps the system achieve better query performance, more practical public key management, and more flexible query type . We present designs that meet these requirements in the next few sections.

3. Preliminary knowledge

This section gives some preliminary knowledge of cryptography on the building blocks used in the proposed algorithm.
Cryptographic hash function: A cryptographic hash function H( ) is an algorithm that takes a message m of arbitrary length as input and outputs a fixed-length hash digest H(m). It has an important property, collision resistance , which shows that a PPT adversary can find two messages m1̸ = m2 such that the probability of H(m1) = H(m2) is negligible.

Merkle Hash Tree [6]: A Merkle Hash Tree (MHT) is a tree structure used to efficiently verify a set of data objects. Figure 4 shows an example MHT with eight objects. In short, MHT is a binary hash tree constructed from the bottom up. Specifically, each leaf node stores the hash value of the indexed object. Each internal node contains a hash computed using its two children (eg, h6 = H(h3||h4), where "||" is the join operation). Due to the collision-resistant hash function and hierarchical structure, the root hash of the MHT (h7 in Figure 4) can be used to verify index data. For example, for a range query [6, 25], the result is {8, 20} and its corresponding proof {5, 31, h6} (shown as shaded nodes in Figure 4). These results can be verified by reconstructing the root hash using the proof and comparing it to the signed root hash. If they match, it means the result has not been tampered with. At the same time, the boundary data 5 and 31 in the proof ensure the integrity of the result.

To support additional queries, MHT has been extended to Merkle B±tree [7] for range queries, Merkle R-tree [8] for spatial queries, and Merkle Patricia Trie [2] for string searches.

Cipherset accumulator [9]: A cipherset accumulator is a function that maps a set X to a digest acc(X) of constant size. Similar to a cryptographic hash function, this digest certifies the corresponding set. In addition, it supports various verifiable set operations, including intersection (denoted by ∩), union (denoted by ∪), and difference (denoted by \). These collection operations can be called in a nested fashion and are validated with the accumulated values ​​of the input collections. Specifically, the cipherset accumulator scheme consists of the following probabilistic polynomial-time algorithm:

• ACC.KeyGen(1λ, U) → pk: When input security parameter λ and universe U, it outputs public key pk.
• ACC.Setup(X, pk) → acc(X): input set X and public key pk, output accumulative value acc(X) of X.
• ACC.Update(acc(X), acc(Δ), pk) → acc(X+Δ): input accumulative value acc(X) of set X, accumulative value acc(Δ) Δ( Including the insertion or deletion of set elements) and the public key pk, output the accumulative value acc(X+Δ) relative to the new set X+Δ.
• ACC.Prove(X1,X2, opt, pk) → {R, πopt}: When inputting two sets X1, X2, set operation opt ∈ {∩, ∪,} and public key pk, it returns the set operation Result R = opt(X1,X2) and prove πopt.
• ACC.Verify(acc(X1), acc(X2), opt, πopt, acc®, pk) → {0, 1}: the accumulative values ​​acc(X1), acc(X2) and X2 of the input set X1, respectively is a proof πopt about the operation opt, the cumulative value of the answer set R and the public key pk, if and only if R = opt(X1,X2).

In this paper, we use the state-of-the-art cryptoset accumulator scheme proposed by Zhang et al. [9], which supports not only incremental updates but also expressive nested set operations. Another nice property of this scheme is that the proof size for any set operation is constant, and the cost of proving a sequence of nested set operations scales linearly with the number of set operations. However, its proof generation is relatively expensive, with a complexity of O(N1 · N2), where N1, N2 are the sizes of the input sets X1, X2, respectively. Furthermore, its proof size is relatively larger than that used in vChain [4] at the expense of expressiveness. At the same time, the public key size of this scheme is O(|U|^2), where |U| is the size of the universe of input set elements. To remedy these shortcomings, we propose an object registration index that assigns each data object a bounded ID to address the public key size problem in Section IV-A. Furthermore, we propose several techniques to reduce the overhead of proof generation in Section VI.

4. Verifiable Boolean query processing

In this section, we consider verifiable Boolean queries with multiple keywords . As we explained before, vChain's query processing can degenerate to a linear scan in the worst case. To address this issue, we propose a novel sliding-window accumulator index design for efficient query processing. The main idea is to build a sliding window accumulator trie (SWA-Trie for short) for each block for the data objects in the last k blocks , where k is the sliding window size. The root hash of the SWA-Trie is embedded in the blockchain header (see Figure 5) to enable verifiable query processing. Below, we discuss in detail the issues related to this design: (i) how to manage the accumulator’s public key through object registration (Section IV-A), (ii) how to efficiently maintain the SWA-Trie index (Section IV- B), (iii) how to support expressing Boolean keyword queries (Section IV-C), and (iv) how to validate query results (Section IV-D).

A object registration

As mentioned earlier, we use a cryptographic set accumulator scheme to authenticate various set operations. However, the public key size of the accumulator scheme used in our design scales quadratically with the size of the universe of input set elements . Recall that the input set elements are the data objects in each sliding window. This poses a challenge to public key management for practical applications. For example, we cannot simply use a cryptographic hash function to encode a data object as a 256-bit integer, which would result in a public key of size (2^256)2 = 2^512. (The amount of data is too large) To solve similar problems, vChain proposes to introduce a trusted oracle, which holds a secret key to dynamically generate public keys [4]. However, such a solution is not ideal in the context of blockchain applications. In a decentralized public blockchain environment, it is not easy to find a trusted third party.
Changed the data structure of the blockchain itself , how to process and analyze the data without changing the data structure of the blockchain, the establishment of the blockchain itself, how do we capture the accurate log work on the supercomputing.
To properly address this issue, we propose to embed an Object Registry (ObjReg) index in each block of the blockchain, as shown in Figure 5. Instead of storing the data objects directly in the collection accumulator, we register each data object with an ID and store the ID in the set accumulator. The ObjReg index is used to track the mapping between data objects and their IDs in the last 2k−1 blocks . Here we enforce a maximum ID, denoted MaxID, which is the maximum possible number of data objects spanning 2k-1 blocks (what is the difference between a data object and a block? Each block contains multiple data objects). Therefore, the size of the universe of input set elements of the set accumulator is limited to MaxID, which limits the size of the public key. For example, we set the MaxID of the dataset to 2 12 in our experiments , which limits the public key size to (2 12 ) 2 = 2 24 . At the same time, this also guarantees that data objects in every consecutive 2k-1 block always have a different ID . As will be shown later, our collection operations only involve data objects within 2k-1 blocks. Therefore, each object in any collection operation is guaranteed to have a unique ID .

ObjReg indexes are fully balanced MHT with fixed fanout . Whenever a new data object arrives, the miner will register the object and assign an ID by incrementing a counter modulo MaxID . Objects are then inserted into the ObjReg index based on their ID . Since ObjReg indexes are complete trees with a fixed fanout, the object's position can be easily calculated by interpreting the ID as a number using the fanout as the base. Consider data object o6 in Figure 6 . Since ID 6 can be interpreted as 020 in radix-3, o6 can be located by tracking the 1st, 3rd, and 1st nodes in the corresponding tree levels. With the ObjReg index and the ID of the query result, the user can use the ObjReg index to verify the query result, just like in normal MHT. In the example of Figure 6, where o6 is the query result, the SP will return {o6, h11, h12, h1, h2} to the user. On the user side, the root hash of the ObjReg tree is reconstructed and compared to the hash stored in the block header. If the verification is passed, it can be determined that the data object o6 indeed corresponds to ID 6.
insert image description here

B. Maintenance of the SWA-Trie

Recall that in our design, each SWA-Trie is constructed based on the data objects in the most recent k blocks. Figure 7 shows an example of our designed trie structure with an indexed sliding window size of 4 . For the sake of illustration, we assume that each block contains a single data object (only one data in each block). In this example, the trie structure Ti is built on objects with IDs {id1, id2, id3, id4}. Each trie node n contains a hash digest (denoted by hn) to form a Merkle tree. For the root node and each leaf node, we also store a set of object IDs (denoted by Sn) and the corresponding cumulative value of the set ( denoted by accn ). Let H(·) be the cryptographic hash function, || be the string concatenation operation, and acc(·) be the password set accumulator. We define the fields of each trie node as follows.
Accumulator:
Password Accumulator
Collection Accumulator
**insert image description here

Definition: (SWA-Trie leaf node). The fields of leaf node n are defined as: • wn is the relevant key field of n; • Sn = ID set of objects covered by n; • accn = acc(Sn); • hn = H(H(wn)||accn) .
Definition
Definition 2 (SWA-Trie non-leaf nodes). Denote the child nodes of non-leaf node n as {c1,····,cF}. The fields of n are defined as:
• wn = related key field of n;
• Sn = ID set of objects covered by n (if n is root);
• accn = acc(Sn) (if n is root);
• childHashn = H(hc1|| ···||hcF);
• hn = H(H(wn)||childHashn||acc(Sn)) (if n is root); • hn = H(H(wn)| |childHashn) (if n is a non-root).

  • This incremental update is similar to the related incremental update in the erasure code, whether they can be used with each other.
    -

    In order to incrementally update the SWA-Trie index , we maintain it as a persistent data structure. Algorithm 1 describes the maintenance algorithm. After receiving a new block of data objects , the algorithm deletes the object ID in the kth oldest block (denoted by bi-k+1) and inserts the object ID into the new block (denoted by bi+1 ) . In the example shown in Figure 7, to construct the SWA-Trie Ti+1 of bi+1, the algorithm deletes o1 from Ti and then inserts o5 into Ti+1. After that, new nodes {n8, n9, n10, n11} are calculated in a bottom-up manner. It is worth noting that we do not need to recompute the cumulative value of the new root n8 from scratch. Instead, we can call ACC.Update to incrementally update the accumulated value based on the updated object ID.

    I don't understand what's going on? Roll back one by one, or what should I do?

C. Verifiable query processing

Given a Boolean query of the form Q = ⟨[ts, te], Υ⟩, SP should return all data objects whose keywords satisfy the Boolean expression Υ within the time period, namely {oi = ⟨ti,Wi⟩ | ti ∈ [ts, te] ∧ Υ(Wi) = 1}. To process query requests, our algorithm consists of three steps. First, the query will be divided into a set of subqueries, each with a time window of length k. Each subquery is then processed using the SWA index and the ObjReg index . Finally, the results of all subqueries are combined to generate the final result. The whole query processing process is given in Algorithm 2.

1) Query division: Given a query Q, if the length of the query time window is not less than k , divide Q into multiple subqueries of length k. If the query window cannot be divided correctly, we let the time window of the last subquery overlap with the time window of the previous subquery. For example, assuming k = 4, given a query with time window [t1, t10], in addition to subqueries with time windows [t1, t4] and [t5, t8], the last subquery will be created as time window [t7, t10]. Note that this may produce redundant results, but does not affect the correctness of query processing. On the other hand, if the query time window length is smaller than k, Q will be treated as a special subquery, which will be discussed in Section IV-C3.
2) Subquery processing: For each subquery q = ⟨[ts′ , te′ ], Υ⟩ with a time window length of k, we first traverse the SWA-Trie located in block be′ to obtain the intermediate results Rw and Υ The Merkle proof πw corresponding to each keyword w. To reduce the proof size, we merge πws into πtrie . Then, the verifiable set operation based on Y is performed on the intermediate results , and the result ID set RY and the set operation proof πY are obtained. Finally, the SP will query the ObjReg index located in be' to find the corresponding data object with Merkle proof πobj.
More specifically, first, for each key w in Υ, SP searches the SWA-Trie to find all objects in the trie containing w, summarized in Algorithm 3. SP starts from the root and traverses the SWA-Trie in a top-down manner. If the key field of a trie node n does not match w, then all data objects under this node do not belong to Rw. In this case, if n is a leaf node, SP adds wn and accn to the πtrie as part of the Merkle proof; otherwise, SP adds wn and childHashn (and accn if n is a root) to the πtrie. For each node n whose key field matches w, SP adds Sn to Rw and wn, accn to πtrie if it is a leaf node; otherwise, the subtree is further explored, and wn (and if n is the root, accn) to the πtrie. Note that Merkle proofs for different keywords can share some common paths. Therefore, the Merkle proofs of all keys in Υ can be combined to reduce the proof size .
insert image description here

example. In the example in Figure 8, consider a subquery with time window [ti-2, ti+1] and two keys 5e7a and 5e9b. We should search for trie Ti+1 located in bi+1. We will get the result R5e7a = S6 = {id3, id4}, R5e9b = S7 = {id2, id3} and Merkle proof πtrie = {⟨*, acc8⟩, ⟨5e⟩, ⟨9a, childHash9⟩, ⟨7a, acc6⟩ , 〈9b, acc7〉}.

After SP gets the intermediate result of trie search, it will use set accumulator to perform verifiable set operation according to Υ. To support arbitrary Boolean queries, including the ¬ (NOT), ∧ (AND), and ∨ (OR) operators, we use the accumulator proposed in [9] . Specifically, the ¬, ∧, and ∨ operators in query Boolean expressions can be mapped to the corresponding set difference (\), set intersection (∩) and set union (∪) operations in the set accumulator scheme.

example. In the running example in Figure 8, for the Boolean function Υ1 = 5e7a ∧ 5e9b, SP can get the result RΥ1 = R5e7a ∩ R5e9b = {id3} and set operation to prove πY1 by calling ACC.Prove(R5e7a, R5e9b, ∩, pk). Similarly, for the Boolean function Υ2 = 5e7a ∨ 5e9b, SP can get the result RΥ2
= R5e7a∪R5e9b = {id2, id3, id4} and set operation to prove πΥ2 by calling ACC.Prove(R5e7a, R5e9b, ∪, pk).
For Boolean function Υ3 = ¬5e9b, SP first retrieves all object IDs in Ti+1, namely R∗ = {id2, id3, id4, id5}, whose Merkle proof π∗ = {⟨∗, childHash8)⟩ }. Then, a verifiable set difference operation can be performed to obtain the result RY3
= R∗ \ R5e9b = {id4, id5} and prove πY3
by calling ACC.Prove(R∗, R5e9b, , pk). For the Boolean function Υ4 = 5e7a ∧ (¬5e9b), a verifiable set difference R5e7a \ R5e9b can be performed. For the Boolean function Υ5 = 5e7a ∨ (¬5e9b), nested verifiable set operations are performed. Specifically, SP will first call ACC.Prove on R∗\R5e9b to obtain R¬5e9b. Then calculate a verifiable set union R5e7a ∪ R¬5e9b to get the set operation proof
Next, SP queries the ObjReg index located in be', and finds the corresponding data object according to the result ID. It also computes the Merkle proof πobj for the retrieved result object. Note that the result of the last subquery may share some common objects with its previous subquery. This way, when searching for the data objects of the last subquery, the SP will not search for those already obtained in the previous subquery. Finally, SP packs πtrie, RY, πY, and πobj together as the VO of the subquery.

3) Result Merging: After the SP gets the results of each subquery, it merges them into the final result of the original query.

Note that in the special case where the length of the query time window [ts, te] is less than k, the query Q will be processed as follows. SP will first visit block be and obtain the result set RΥ = {oi = ⟨ti,Wi⟩ | ti ∈ [te−k+1, te]∧Υ(Wi) = 1} and proof. Next, SP locates block bs-1, whose SWA-Trie root node is used to retrieve
all objects of ID set Sns-1 in the sliding window [ts−k, ts−1] and its accumulated value accns−1. After that, The SP invokes a verifiable set difference operation ACC.Prove(RY, Sns-1, , pk) to compute the result set.

D. Query result verification

On the user side, the integrity of the query results can be verified through the following steps. First, the user extracts proofs from VO ⟨πtrie, RΥ, πΥ, πobj⟩. Users can then verify the integrity of keyword searches on the SWA-Trie index and object searches on the ObjReg index by reconstructing the root hash using πtrie and πobj, respectively . If they match those stored in the block header, we can determine the soundness and integrity of these searches . Afterwards, the user can use πΥ to execute ACC.Verify to check the integrity of the set operation of the Boolean function Υ. The complete verification process is given in Algorithm 4.

example. In the running example in Figure 8.
When querying the result and VO when receiving, the user first reconstructs the trie root hash h′8 using πtrie as follows: h′6=H(H(7a)||acc6),H’7 = H(H(9b) || acc7), h′3=H(5e||H(h′6||h′7)), h′9 =H(9a||childHash9) and h′8 = H(*||H(h′ 3||h' 9)||acc8). If h'8 is the same as h8 retrieved from the block header, the integrity of the key search is verified. Next, the user verifies the object result on RΥ using πobj. Finally, the user verifies the integrity of the aggregate operation by calling ACC.Verify (for example, ACC.Verify(acc6, acc7, ∩, πΥ, acc(RΥ), pk).

Extended to other query types

In this section, we discuss how our proposed method can be extended to support other query types, such as range queries and Boolean range queries.

The previous one is the process of keyword query.

VI optimization

We observe that the bottleneck of query processing lies in verifiable set operations , whose overhead is determined by the size of the input set. In this section, we propose three optimization techniques to improve query performance.
One-dimensional range query. Given a range query of the form Q = ⟨[ts, te], [α, β]⟩, SP should return all data objects with values ​​in the range [α, β], i.e. {oi = ⟨ti, vi⟩ | ti ∈ [ts, te] ∧ vi ∈ [α, β]}. We can follow a similar sliding window design for query processing. Miners can construct a SWA-B±Tree to index the values ​​of data objects. Figure 9 shows an example of such a SWA-B±Tree with index sliding window size 4. Each tree node n contains the following fields: hash digest (denoted by hn), value or range of values ​​(denoted by vn or [ln, un]), a set of object IDs (denoted by Sn), and the corresponding cumulative value of the set ( expressed in accn). We define them as follows.
definition

Experimental evaluation

We conduct experiments on a machine running CentOS 8 with dual Intel Xeon E5-2620 v3 2.4GHz CPUs. We limit query users to only 4 threads during verification, while miners and SPs use all available CPU cores. The vChain+ system was implemented in the Rust programming language and used the following dependencies: Arkworks for bilinear pairing on BN254 curves to implement set accumulators, Blake2b3 for 256-bit hashing, and Rayon4 for parallel computing. Source code is available at https://github.com/hkbudb/vchain-plus. The same programming language and dependencies are also used to implement vChain [4] as a baseline, including two proposed accumulator structures, labeled vChain-acc1 and vChain-acc2.
Two accumulator structures:

data set

Foursquare (4SQ) [5]: The 4SQ dataset includes 1 million user check-in records with time stamps. We pack records within every 30 seconds into a block, and each record is expressed in the form of 〈timestamp, [longitude, latitude], {check-in place's keywords}〉.
• Ethereum (ETH) [2]: The ETH dataset was extracted from the Ethereum blockchain from December 17, 2018 to December 26, 2018. It contains about 58,100 blocks, about 3.27 million transaction records, and the block time interval is about 15 seconds. Each record can be expressed in the form of 〈timestamp, [amount], {addresses}〉, where amount is the transfer amount, and {addresses} are the addresses of the sender and receiver.

Performance evaluation:
(1) CPU time for query: sp query time + user result verification time
(2) VO (Verification object) size transmitted from sp to user
Intersection and union of query results of different data sets.
Range query performance
Intersection + range query
Union + range query

CPU query time and OV size.
We can measure the CPU query time,
randomly generate 10 queries and calculate the average
A, ADS (authenticated data structure)
construction cost Table 2 shows the ADS construction cost on the miner side, including ADS construction time and ADS size. In vChain, the maximum size of the inter-block index is set to 32. For vChain+, we set the sliding window size to {2, 4, 8, 16, 32}, the fanout of the SWA-B±tree and the ObjReg index to 4. All optimizations presented in Section VI are used for vChain+. As can be seen from Table 2, the index construction time of vChain+ is longer than that of vChain-acc2, but shorter than that of vChain-acc1. Also, vChain+ generates larger ADSs compared to vChain. This is expected since the aggregate accumulator size used in vChain+ is larger than in vChain to support more expressive aggregate operations. Moreover, as we discussed in Section VI-A, the design of multiple sliding windows introduces multiple SWA (sliding window accumulator) indexes, which also increases the ADS size in each block. On the client side, the block header size is fixed at 104 bytes in Vchain and Vchain+.
B. Query performance
Figures 10 to 17 compare the query performance of vChain and vChain+ by changing the query time window from 100 blocks to 8,100 blocks. Five types of query conditions were examined, including ∨- and ∧-joined Boolean keyword queries, range-only queries, and ∨- and ∧-joined Boolean range queries. Thanks to the tree-based index search and accumulator-based sliding window design, vChain+ can efficiently handle various types of queries. Overall, vChain+ improves query performance by 913x for vChain-acc2 and 1098x for vChain-acc1. Note that vChain+ has a larger VO size than vChain in most cases. This is because the size of the collective operation proof generated by vChain+ is larger than that in vChain. However, considering the global average mobile network speed of 29.06 Mbps [12], the total time of VO transmission and query processing of vChain+ is still better than that of vChain. For example, as shown in Figure 16, when the query time window is 8100, the VO size and query time of vChain+ are 623KB and 0.05s, respectively, and the VO size and query time of vChain-acc2 are 68KB and 1.59s, respectively. The total time of VO transmission and query processing of vChain+ at the median mobile network speed is 0.221s, which still improves the performance of vChain by 7.3 times. When processing Boolean range queries for ∨ connections on ETH, we observe that vChain-acc2 performs slightly better than vChain+ when the time window length is 100 blocks (Fig. 17). This is because the generation of set operation proofs dominates query time in vChain+. Since ∨ concatenating the Boolean condition involves a union operation, it causes the input set of ACC.Prove to enlarge, resulting in a more onerous cryptographic operation.
C. Effects of optimization and selectivity
We now evaluate the impact of three different optimization techniques on query performance and VO size. We test boolean range queries for ∧ joins on the ETH dataset. We enable all optimizations as a baseline (denoted all), and then disable each optimization to investigate their impact. Specifically, we run experiments without (i) multi-sliding windows (no multi-win), (ii) without optimizing query plans (no qp), and (iii) without pruning the empty set (no pruning). Figure 18 shows the query performance for different optimizations by changing the query time window from 100 blocks to 8,100 blocks. It can be seen that pruning the empty set and optimizing the query plan works well for most of the queries and brings the largest performance improvement.

Guess you like

Origin blog.csdn.net/weixin_41523437/article/details/124567223