M312: Diagnostics and Debugging chapter 4 Connectivity学习记录

运行环境

操作系统：windows 10 家庭中文版
Mongodb ：Mongodb 3.4

Mongodb安装路径：E:>MongoDB\Server\3.4\bin\
Mongodb存储路径：E:>MongoDB\data

课后问题

lab: Diagnosing unknown issues

Download Handouts:

In this lab, you are asked to diagnose an unknown issue. Developers relying on a MongoDB deployment you administer are complaining that database is taking longer and longer to respond. Additionally, the server which the current primary is on crashes after an unspecified amount of time. This happens on every server.

You can download the required dataset, companies.json, here.

Let’s go ahead and setup the environment.

First, launch a replica set using mtools with the name “m312rs”

cd m312-vagrant-env
vagrant up
vagrant ssh
mlaunch --replicaset --name "m312rs" --port 30000 --wiredTigerCacheSizeGB 0.5 --oplogSize 100

The following step is to download the necessary handouts and make them available in your m312 virtual machine:

cp Downloads/lab1_chapter4.py m312-vagrant-env/shared
cp Downloads/companies.json m312-vagrant-env/dataset

The command line example is set for Unix but a simple copy of this file will suffice

Next, we will load the dataset into our replica set. Let’s load the data into a collection named companies

vagrant ssh
mongoimport --host m312rs/m312 --port 30000 -d m312 -c companies /dataset/companies.json

Now execute the homework script

/shared/lab1_chapter4.py

Open a new terminal window and make sure to ssh in to your vagrant box.
Let the script run for a minute and begin checking the replica set. What is the primary issue?

Check all that apply:

{w: “majority”} is being specified and it’s taking too long for the mongods to acknowledge writes.
The connection pool continually grows,ultimately starving the server of resources.
The replica set has no primary.
The amount of operations has caused the mongods to reach the CPU limit for the server.

解答

进入容器环境

C:\Users\Shinelon>e:

E:\>cd MongoDB\m312\chapter_3_slow_queries\m312\m312-vagrant-env

启动，并使用ssh方式进入容器

E:\MongoDB\m312\chapter_3_slow_queries\m312\m312-vagrant-env>vagrant up
E:\MongoDB\m312\chapter_3_slow_queries\m312\m312-vagrant-env>vagrant ssh

执行mlaunch命令创建一个有三台机器的副本集集群

vagrant@m312:~$ mlaunch --replicaset --name "m312rs" --port 30000 --wiredTigerCacheSizeGB
0.5 --oplogSize 100
A different environment already exists at /home/vagrant/data.
vagrant@m312:~$ ls /home/vagrant/data/
authdb  db  mongod.log
vagrant@m312:~$ ps -ef|grep mongo
vagrant   1956  1925  0 06:28 pts/0    00:00:00 grep --color=auto mongo
vagrant@m312:~$ sudo mv /home/vagrant/data /home/vagrant/data_bak
vagrant@m312:~$ mlaunch --replicaset --name "m312rs" --port 30000 --wiredTigerCacheSizeGB
0.5 --oplogSize 100
launching: mongod on port 30000
launching: mongod on port 30001
launching: mongod on port 30002
replica set 'm312rs' initialized.

按题意将lab1_chapter4.py文件存入路径E:\MongoDB\m312\chapter_3_slow_queries\m312\m312-vagrant-env\shared；
companies.json文件存入路径E:\MongoDB\m312\chapter_3_slow_queries\m312\m312-vagrant-env\dataset

执行脚本，导入companies.json：

vagrant@m312:~$ mongoimport --host m312rs/m312 --port 30000 -d m312 -c companies /dataset/
companies.json
2018-04-26T06:33:40.514+0000    connected to: m312rs/m312:30000
2018-04-26T06:33:43.508+0000    [#########...............] m312.companies       30.0MB/74.6MB (40.3%)
2018-04-26T06:33:46.510+0000    [##################......] m312.companies       57.8MB/74.6MB (77.5%)
2018-04-26T06:33:48.458+0000    [########################] m312.companies       74.6MB/74.6MB (100.0%)
2018-04-26T06:33:48.458+0000    imported 18801 documents

启动lab1_chapter4.py脚本

vagrant@m312:~$ /shared/lab1_chapter4.py
application running, open a new terminal window!
spawning a new process
spawning a new process

按题意开启一个新的窗口，并使用ssh进入容器

尝试进入副本集主节点：

vagrant@m312:~$ mongo -port 30000
MongoDB shell version v3.4.2
connecting to: mongodb://127.0.0.1:30000/
MongoDB server version: 3.4.2
Server has startup warnings:
2018-04-26T06:28:50.423+0000 I STORAGE  [initandlisten]
2018-04-26T06:28:50.423+0000 I STORAGE  [initandlisten] ** WARNING: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine
2018-04-26T06:28:50.423+0000 I STORAGE  [initandlisten] **          See http://dochub.mongodb.org/core/prodnotes-filesystem
2018-04-26T06:28:50.545+0000 I CONTROL  [initandlisten]
2018-04-26T06:28:50.545+0000 I CONTROL  [initandlisten] ** WARNING: Access control is not enabled for the database.
2018-04-26T06:28:50.545+0000 I CONTROL  [initandlisten] **          Read and write access to data and configuration is unrestricted.
2018-04-26T06:28:50.545+0000 I CONTROL  [initandlisten]
2018-04-26T06:28:50.545+0000 I CONTROL  [initandlisten]
2018-04-26T06:28:50.545+0000 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
2018-04-26T06:28:50.545+0000 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2018-04-26T06:28:50.545+0000 I CONTROL  [initandlisten]
2018-04-26T06:28:50.545+0000 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'.
2018-04-26T06:28:50.545+0000 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2018-04-26T06:28:50.545+0000 I CONTROL  [initandlisten]
MongoDB Enterprise m312rs:PRIMARY>

证明主节点存在，选项

The replica set has no primary.

排除

使用writeConcern测试副本间传输速度

MongoDB Enterprise m312rs:PRIMARY> db.foo.insertOne({hello:'world'},{writeConcern:{w: 3,wtimeout:1}})
2018-04-26T06:49:18.026+0000 E QUERY    [thread1] WriteConcernError: waiting for replication timed out :
WriteConcernError({
        "code" : 64,
        "codeName" : "WriteConcernFailed",
        "errInfo" : {
                "wtimeout" : true
        },
        "errmsg" : "waiting for replication timed out"
})
WriteConcernError@src/mongo/shell/bulk_api.js:504:48
Bulk/mergeBatchResults@src/mongo/shell/bulk_api.js:841:52
Bulk/executeBatch@src/mongo/shell/bulk_api.js:906:13
Bulk/this.execute@src/mongo/shell/bulk_api.js:1150:21
DBCollection.prototype.insertOne@src/mongo/shell/crud_api.js:242:9
@(shell):1:1
MongoDB Enterprise m312rs:PRIMARY> db.foo.insertOne({hello:'world'},{writeConcern:{w: 3,wtimeout:1000}})
{
        "acknowledged" : true,
        "insertedId" : ObjectId("5ae1773b48abd6a2ab43014a")
}
MongoDB Enterprise m312rs:PRIMARY> db.foo.find()
{ "_id" : ObjectId("5ae1766d48abd6a2ab430148"), "hello" : "world" }
{ "_id" : ObjectId("5ae1768848abd6a2ab430149"), "hello" : "world" }

副本见传输顺畅，选项

{w: “majority”} is being specified and it’s taking too long for the mongods to acknowledge writes.

排除

查看cpu使用情况：

vagrant@m312:~$ top

top - 07:08:29 up 41 min,  2 users,  load average: 0.44, 0.84, 0.73
Tasks: 598 total,   2 running, 596 sleeping,   0 stopped,   0 zombie
%Cpu(s): 10.6 us,  4.4 sy,  0.0 ni, 84.5 id,  0.0 wa,  0.0 hi,  0.5 si,  0.0 st
KiB Mem:   4048092 total,  3550844 used,   497248 free,    26120 buffers
KiB Swap:        0 total,        0 used,        0 free.   558316 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 2057 vagrant   20   0 1547816 242748   8528 S   3.0  6.0   0:42.65 mongod
 2005 vagrant   20   0 1702516 270388   9188 S   2.7  6.7   0:44.13 mongod
 2031 vagrant   20   0 1563172 246972   8792 S   2.0  6.1   0:42.56 mongod
 2956 vagrant   20   0  424600  12292   2108 S   0.7  0.3   0:00.81 python
 2323 vagrant   20   0   54104  14064   3688 S   0.3  0.3   0:02.00 python
 2328 vagrant   20   0  424600  12188   2104 S   0.3  0.3   0:00.90 python

该虚拟机明显cpu还有很多剩余，选项

The amount of operations has caused the mongods to reach the CPU limit for the server.

错误

观察脚本lab1_chapter4.py，发现他就是一个无限创建数据库连接进程的脚本：

def create_a_connection(host, port, replset):
    client = MongoClient(host=host, port=port, replicaSet=replset)
    try:
        query = re.compile(random.choice(ascii_lowercase), re.IGNORECASE)
        client.m312.companies.find({"name": query})
        sleep(5)

    except Exception, e:
        print ("Exception caught, but I can handle this: {e}".format(e=e))

    while True:
        # noop
        sleep(30)

检查连接数：

MongoDB Enterprise m312rs:PRIMARY> db.serverStatus().connections
{ "current" : 499, "available" : 320, "totalCreated" : 1010 }

连接池已经爆了

所以答案为

The connection pool continually grows,ultimately starving the server of resources.