MongoDB best practices (reprint)

Foreword

As a solutions architect MongoDB, most of my time in and MongoDB are customer and user interaction. Here, I would like to sort out MongoDB developers through a constantly updated live articles to collect and maintain worth knowing when or follow some best practices for everyone. I truly hope that you can be involved, and jointly safeguard the documents, so that more users to benefit (by the end the Micro Signal contact me)

This article includes the following aspects:

- 安全措施
- 部署架构
- 系统优化
- 监控与备份
- 索引
- 开发与模式

If you are going to have on-line or on-line based MongoDB core business applications, please contact the team in Greater China and MongoDB ( [email protected] ) to obtain professional advice from the official authority of the consultant team.

About Security

MongoDB clusters to enable authentication authentication

MongoDB server authentication is not enabled in the default installation. This means that everyone can connect directly to mongod instance and perform any database operations. According to the document recommended that you enable authentication http://docs.mongoing.com/manual-zh/tutorial/enable-authentication.html

Permissions assigned different roles for different users

MongoDB support system permissions by role definition. You should be based on the "least privilege" guidelines explicitly assign rights only for the needs of the user.

Using a central authentication server

Use as LDAP, Kerbero like a central authentication server, and use a strong password policy.

MongoDB is a need to access the application server creates a whitelist (firewall configuration)

If your server has multiple network cards, it is recommended only listening service on the internal network IP.

Sensitive data using the encryption engine

MongoDB Enterprise Edition supports storage encryption, sensitive data related to the customer should use the encryption engine to protect the data.

About deployment

Using at least three sets of data copied nodes

MongoDB deployment recommended minimum is three nodes constituting the data set copy. Copy sets can provide the following advantages:
- The system availability 99.999%
- automatic failover
- data redundancy
- disaster Deployment
- separate read and write

Not too early to fragmentation

Fragmentation can be used to expand the capacity of your system to read and write, but fragmentation will bring many new challenges for example, increase the cost of complexity of management, select the appropriate piece key challenges and so on. In general, you should first exhausted other options for performance tuning after began to consider fragmented, for example, index optimization, model optimization, code optimization, optimizing hardware resources, IO optimization.

Select the appropriate number of fragments

Some fragments of trigger conditions:

  • The amount of data too large to be managed on a single server
  • Concurrency is too high, a single server can not process
  • Disk IO too much pressure
  • Stand-alone system memory is not large enough to hold hot data
  • Server NIC processing capacity bottlenecks
  • Under many circumstances wish to support the deployment of localized literacy

Depending on your trigger conditions slice, you can follow and then divided by the total demand for capacity for each server to determine the number of fragments required.

Deploy enough replica set member for each slice

Between each data slice is not copied. Data for each slice must ensure high availability within a slice. Thus, for each fragment in claim MongoDB deploying at least three data nodes to ensure that most of the time slice is not the master node because data is not available due to downtime.

Select the appropriate key sheet

In fragmentation scenario, the most important consideration is to select the appropriate key sheet. Select key sheet need to consider the read-write mode application. In general, a key piece either write to optimize either the read operation optimization. And to make the appropriate trade-offs depending on which operating more frequently.

  • Key sheet should have a very high base, or, the key sheet there are many different values ​​in the set, e.g. _id is high because a base sheet _id key value will not be repeated
  • Key piece in general should not be growing, for example, timestamp is a key piece of sustained growth. Such key sheet is easy to cause thermal fragmentation phenomenon, i.e., the new write concentrated on one slice
  • Good key sheet should make a query (or several) on a slice directed to a query to improve efficiency. In general this means that sheet should include the key fields most commonly used queries
  • It should be good enough dispersion key sheet, so that the newly inserted can be distributed over a plurality of slices to enhance the rate of concurrent writes.
  • A combination of several fields may be used to bond sheet composed of, in order to achieve several different objects (base, dispersibility, and the query is directed, etc.)

About System

Use or RAID10 SSD storage to improve IOPS capacity

MongoDB is a high-performance concurrent database, most of its random IO operations update. Generally the machine comes to SSD is the best storage solution. If you use an ordinary hard drive, we recommend using RAID10 striping to improve concurrency IO channel.

Using separate physical volume Data and Journal / log

MongoDB lot of IO bottlenecks and performance-related. Recommendations for the log disk (Journal and system logs) a single set of physical volumes, to reduce the footprint of the data disk IO.

The system log can be directly on the command line or the configuration parameter specifies the paper. Journal log specified is not supported directly to another directory, it can be solved by creating a symbol link to the Journal directory mode.

Use XFS file system

MongoDB is recommended in WiredTiger storage engine uses XFS file system. Ext4 is the most common, but because of internal journal and WiredTiger ext file system, there is a conflict, so poor performance in the larger IO pressure situation.

Under WiredTiger used with caution Ultrabuffer

WiredTiger write to drop the disc occurs asynchronously. The default is 60 seconds to do a checkpoint. Checkpoint need to do all the dirty data in memory to traverse in order to organize and then writes the data to the hard disk. If the cache is large (eg, greater than 128G), then this checkpoint will take longer time. During the checkpoint data write performance will be affected. The actual cache settings currently recommended 64GB or less.

Close Transparent Huge Pages

Transparent Huge Pages (THP) is a means of optimizing the Linux memory management to reduce overhead Translation Lookaside Buffer (TLB) by using more memory pages. Most MongoDB database is more dispersed small amount of data read and write, THP have a negative impact on the MongoDB This condition is recommended to close.

http://docs.mongoing.com/manual-zh/tutorial/transparent-huge-pages.html

Enable Log Rotation

MongoDB prevent the log file to grow indefinitely, taking up too much disk space. Good practice is to enable log rotation and promptly clean up the history log file.

http://docs.mongoing.com/manual-zh/tutorial/rotate-log-files.html

Allocate enough space Oplog

Oplog enough space to ensure there is enough time for you to recover from scratch from a node, or from node to perform some of the more time-consuming maintenance operations. Suppose you need the longest off the assembly line maintenance operations H hours, then your general Oplog can save at least want to ensure that H 2 or H 3 hours oplog.

If your MongoDB deployment time is not set correctly Oplog size can be adjusted with reference to the following link:

http://docs.mongoing.com/manual-zh/tutorial/change-oplog-size.html

Close the database file atime

Prohibit access time of the file system update will improve the performance of file read. This can be achieved by increasing the noatime parameter in / etc / fstab file. E.g:

/dev/xvdb /data ext4 noatime 0 0

After modifying the file to re-mount can:

# mount -o remount /data

Increase the default file descriptor and process / thread limit

Linux default file descriptor number and the maximum number of processes for MongoDB is generally too low. Recommended that this value is set to 64000. Because MongoDB database server for each file and each client will need to use a file descriptor connected. If this number is too small may be wrong or fail to respond in the case of large-scale concurrent operations. You can modify these values ​​with the following command:

ulimit -n 64000
ulimit -u 64000

Prohibited NUMA

NUMA technology in use on a multiprocessor Linux systems, you should prohibit the use of NUMA. MongoDB operating performance may be slow sometimes in a NUMA environment, especially under high load process.

Prefetch value (the readahead) provided

Pre-reading is a means to optimize the operating system file, when the program is roughly request to read a page, the file system will read the following few pages at the same time and return. The reason for this is because very often the most time-consuming disk IO seek. By read-ahead, the system can immediately ahead of the data and returns. Assuming that the program is doing a sequential read operations, so it saves a lot of disk seek time.

MongoDB will often do random access. For random access, this value should be set pre-reading smaller is better. Generally 32 is a good choice.

You can use the following commands to display the current value of the read-ahead system:

sudo blockdev --report

To change the pre-reading, you can use the following command:

sudo blockdev --setra 32

Suitable into the storage device.

Use NTP time server

When using MongoDB replication set or cluster fragmentation, attention must use NTP time server. This will ensure proper synchronization between the principles into MongoDB clusters.

About monitoring and backup

Database of important indicators for monitoring and alarm

Key indicators include:

- Disk Space 磁盘空间
- CPU
- RAM 使用率
- Ops Counter 增删改查
- Replication Lag 复制延迟
- Connections 连接数
- Oplog Window 

To monitor the slow query log

MongoDB will (mongod.log) recorded more than 100ms database operations in the log files by default.

About Index

Establish appropriate index to your every query

This is for a large amount of data collection for example, more than tens of millions (the number of documents) orders of magnitude. If you do not need all the index MongoDB Document into memory from the disk, which will cause greater pressure MongoDB server and affect the execution of other requests.

Create the appropriate composite index, do not rely on cross-indexed

If you will use to query multiple fields, MongoDB has two indexing technique can be used: a combination of index and cross-reference. Cross-reference is to create a single-field index separately for each field, and then use the appropriate query execution time of a single cross-field index index obtained results. Cross Reference triggered the current low rate, so if you have more than one field when the query is recommended to ensure that the index composite index of normal use.

For example, if an application needs to find all younger than 30 years old in Shenzhen City marathon runners:

db.athelets.find({sport: "marathon", location: "sz", age: {$lt: 30}}})

Then you might need such an index:

db.athelets.ensureIndex({sport:1, location:1, age:1});

Combination index field order: first matching condition, the range of conditions (Equality First, Range After)

Described above as an example, when creating an index if the combinations match the conditions and scope of the points, the matching conditions (sport: "marathon") should be in the front of the composite index. Range condition (age: <30) field should be on the back of the composite index.

Whenever possible, use a covering index (Covered Index)

Sometimes you just need a query returns very little or even just a field, for example, you want to find all flights to all destinations Hongqiao Airport departure. Existing index is:

{origin: 1, dest: 1}

If the normal query would be like this (only need to return to the airport of destination):

db.flights.find({origin:"hongqiao"}, {dest:1});

Such a query will contain the default _id field, it is necessary to scan the document and retrieve the matching results. Conversely, if you use this query:

db.flights.find({origin:"hongqiao"}, {_id:0, dest:1});

MongoDB you can get all the values ​​need to return directly from the index, without scanning the actual document (the document may need to be transferred from the hard disk into memory)

To build the index in the background

When creating an index for a collection, the collection database resides will not accept other read and write operations. Collection of construction index for the amount of data, it is recommended to use the background option {background: true}

About Development

Design Mode

Do not follow to design relational table structure

MongoDB can make you feel like a relational database table structure as designed, but it does not support foreign keys, nor does it support sophisticated Join! If your program found a large number of practical JOIN place, that your design may need to start over. Refer to the following design recommendations related patterns.

The number of database collection (collection) is not too much

MongoDB schema design based on a flexible rich JSON document mode. In many cases, the number of collection (table) in a MongoDB database application should be far less than the same type of applications that use relational databases. MongoDB table design does not comply with the third paradigm. MongoDB data model is very close to the object model, it is basically in accordance with the number of primary Domain object corresponding to the set of construction. According to experience, the number average small set usually within a few applications, the application medium and large or up to 10 will be more than in dozens.

Do not be afraid of data redundancy

MongoDB pattern design not in accordance with the third paradigm, allowing data to be repeated many times in multiple documents, for example, repeating his name in every department staff document is an acceptable practice. If the department name change, you can perform an update ({}, {}, {multi: true}) multiple document updates the department name to a one-time update out.

Suitable and not suitable for the type of data redundancy

In general, if the data value of a field often become, it is not suitable for a large number of redundant to other documents, or other collection inside. For example, if we are doing some type of equity asset management, there may be a lot of people to buy Apple stock, but if the stock price change frequently redundancy to the customer's documentation, because the stock price changes frequently, it will lead to a large number of update operations. From another perspective, if not frequently change some fields, such as the customer's name, address, department, etc., although it can be a redundant shi'yang

For 1: N (some of) the relationship between the use of all embedded

For many relationship, such as a person has several Information, a book has 10 chapters, and so on, it is recommended to use an embedded mode, the data in an array of N is described, such as:

    > db.person.findOne()
    {
      user_id: 'tjworks',
      name: 'TJ Tang',          
      contact : [
         { type: 'mobile', number: '1856783691' },
         { type: 'wechat', number: 'tjtang826'}
      ]
    }

Of 1: 1 relationship NN (a lot of) the use of embedded ID

Sometimes a large number of the many multi-terminal, for example, how many employees within a department. In a three-tier department Huawei may have thousands of employees, this time is certainly not a good choice if all employee information directly embedded within the department, it is possible to exceed the limit of 16MB document. This time can be used reference ID is:

> db.departments.findOne()
{
    name : 'Enterprise BG',
    president: 'Zhang San',
    employees : [     // array of references to Employee colletion
        ObjectID('AAAA'),    
        ObjectID('F17C'),    
        ObjectID('D2AA'),
        // etc
    ]
}

If you need to query the department under employee-related information, you can use the $ lookup aggregation operators to associate the employee information and return.

The relationship between the use of NNN (a lot) of: 1 pair

If the next-to-many cases, this many-fold the number of infinite growth and frequently, for example, a meter reading every minute, a year down there are hundreds of thousands, even if this time is put into the ID will be the inconvenience array management , this time should be to create a multi-terminal data collection, and add connections to the main document in the document collection's reference, such as:

    > db.sensors.findOne()
    {
        _id : ObjectID('AAAB'),
        name : 'engine temperature',
        vin : '4GD93039GI239',
        engine_id: '20394802',
        manuafacture: 'First Motor',
        production_date: '2014-02-01'
        ...
    }

    >db.readings.findOne()
    {
        time : ISODate("2014-03-28T09:42:41.382Z"),
        sensor: ObjectID('AAAB'),
        reading: 67.4            
    }

Large binary files and metadata stored collection points

If you need to have PDF files, pictures, video and even small binary files to manage, we recommend the use of MongoDB GridFS API or to manually manage separate sub-set of binary data and metadata.

Frequently updated data is not placed in a nested array

An array is used to express the weapon to-many relationship, but MongoDB lack of direct updating capabilities within a nested array elements. For example:

{
    name: "Annice",
    courses: [
        { name: "English", score: 97 },
        { name: "Math", score: 89 },
        { name: "Physics", score: 95 }
    ]
}

This design does not have nested arrays, we can directly modify the Math score of 99:

db.students.update({name: "Annice", "courses.name":"Math"}, {$set:{"courses.$.score": 99 }})

Note that the use of an array of locators $, $ represents the index of the first array element of the current match in the array.

However, this case involves the following nested arrays to:

    {
        name: "Annice",
        courses: [
            { name: "Math", scores: [ 
                                {term: 1, score: 80} ,
                                {term: 2, score: 90}
                            ] 
             },
            { name: "Physics", score: 95 }
        ]
    }

This time if you want to Score term of 1 Math course to make changes, you need to be transferred across the array of memory scores and modify this element is nested array in code. This is because $ MongoDB array locator is only valid for the first group of layers.

Of course, if your model does not require modifications within nested array elements, then this does not apply.

Configuration

Suitable set MongoDB connection pool size (Connections Per Host)

Java-driven default connection pool size is 100. Recommended adjustments in accordance with the actual situation of the application. For smaller applications can appropriately adjust the pressure to reduce the small footprint of the application server.

Set concerned about the proper use of write (Write Concern)

The minimum recommended deployment MongoDB is a replication set includes three data nodes. Write (update, insert, or delete) will return immediately after the completion of the master node on the default application. A write operation is replicated to other nodes by way OPLOG asynchronously in the background. In extreme cases, these may not have to write down the nodes appear from time to replicate the master node. This occurs when the active and standby switching node, the write operation will be rolled back to the original primary node to a file and not visible to applications. To prevent this situation, MongoDB recommend the use of important data {w: "marjority"} option. {W: "majority"} can ensure that data copied to return a successful result in the majority node before. Using this mechanism can effectively prevent the occurrence of rollback data.

Also you can use the {j: 1} (can w: "majrotiy" combination) to specify the data to be returned to the application was successfully confirmed after the write log WAL. This will lead to decreased write performance, but for the important data can consider using.

Proper use read option settings (Read Preference)

MongoDB Because it is a distributed system, a data is copied on multiple nodes. Read data from the node on which, according to the needs of the application data and can not be read. The following are the focus can be configured to read options:

  • primary: By default, the read data on the primary node
  • priaryPreferred: start reading on the primary node, if it is successful then read from the node Renyiyitai
  • secondary: read data from the node (when more than one node, using a random slave node)
  • secondaryPreferred: First, from the read from the node, the node if from some reason can not provide the service, read from the master node
  • nearest: read from the nearest node. The distance is determined by the time of the ping operation.

In addition to the first option, there are other options for reading read data may not be current. The reason is that data replication is done asynchronously in the background.

Do not instantiate multiple MongoClient

MongoClient is a thread-safe class, comes the thread pool. Usually within a JVM MongoClient not instantiate multiple instances, to avoid unnecessary waste of resources and too many connections.

Use Retry mechanism to write

MongoDB replication set using the technology can achieve 99.999% availability. When a master node can not be written, the system will automatically failover to another node. Transfer may be time-consuming for a few seconds, during which the corresponding application should catch Exception and retry the operation. There should backoff retry mechanism, e.g., respectively, retry 1s, 2s, 4s, 8s, etc. time.

Avoid using long field names

MongoDB no table structure definition. The structure of each document is determined by the internal documents of each field. All field names will be repeated within each document. Use the field name is too long will lead to memory, network bandwidth more demand. (Due to compression technology, long field names stored on the hard disk will not take up too much)

Use a regular naming

Such as: School, Course, StudentRecord
or: school, course, stuent_record

Proper use update statement

Do not MongoDB database and a common key (KV) are considered equivalent. MongoDB supports relational database update and a similar statement in place update. You only need to specify the fields to be updated in the update statement, rather than the entire document object.

For example, I would like to join the user's name was changed from TJ Tang Jianfa.

Not recommended practice:

    user = db.users.findOne({_id: 101});
    user.name="Tang Jianfa"
    db.users.save(user);

Recommended practices:

    user = db.users.findOne({_id: 101});        
    // do certain things
    db.users.update({_id:101}, {$set: {name: "Tang Jianfa"}});

Use projection (Projection) to reduce the content returned

MongoDB support SQL-like statements inside select, you can filter the fields returned. Use Projection returned content can be reduced, reducing the amount of code and transmitted over the network is converted to the time required for the object.

Use TTL to automatically delete the expired data

Many times we use MongoDB to store some of the timeliness of data, such as monitoring data for 7 days. Write their own script and its background regularly clean up stale data, you can use the index to make MongoDB TTL automatic deletion of expired data:

db.data.ensureIndex({create_time:1}, {expireAfterSeconds: 7*24*3600})

Use the execute command to achieve upsert

Sometimes you do not know whether a document data already exists in the library. This time you either first inquiries about, or is using upsert statement. In SpringData below upsert statement requires you to format the value of each field are out in upsert statement. Field and more time is somewhat cumbersome. SpringData MongoDB inside MongoTemplate there execute method can be used to implement a DB call, do not tedious to list all the fields out examples.

    public boolean persistEmployee(Employee employee) throws Exception {

        BasicDBObject dbObject = new BasicDBObject();
        mongoTemplate.getConverter().write(employee, dbObject);
        mongoTemplate.execute(Employee.class, new CollectionCallback<Object>() {
            public Object doInCollection(DBCollection collection) throws MongoException, DataAccessException {
                collection.update(new Query(Criteria.where("name").is(employee.getName())).getQueryObject(),
                        dbObject,
                        true,  // means upsert - true
                        false  // multi update – false
                );
                return null;
            }
        });
        return true;
    }

Delete SpringData MongoDB following _class field

SpringData MongoDB default will add a _class fields in MongoDB document, which is stored in the fully qualified class name, such as "com.mongodb.examples.Customer". For some small document, this field may not occupy a small part of the storage space. If you do not want SpringData automatically added to this field, you can:

1) Custom MongoTypeMapper

@Bean
public MongoTemplate mongoTemplate() throws UnknownHostException {
    MappingMongoConverter mappingMongoConverter =  new MappingMongoConverter(new DefaultDbRefResolver
            (mongoDbFactory()), new
            MongoMappingContext());
    mappingMongoConverter.setTypeMapper(new DefaultMongoTypeMapper(null));
    return new MongoTemplate(mongoDbFactory(), mappingMongoConverter );
}

2) When using find statement explicitly specify the name of the class / type:

    MongoTemplate.find(new Query(), Inventory.class))

转载自:
http://www.mongoing.com/archives/3895

Guess you like

Origin www.cnblogs.com/xibuhaohao/p/12157594.html