Teuthology常用的task

下面列举了一些常用的tasks，还有很多没列出来，可以自己去查看tasks。这也是这个系列的最后一篇了吧，其他，比如代码执行流程之类的，代码比较简单也有什么好写的了，如果有需要可以写一写。

CentOS下搭建Teuthology Ceph自动化测试平台（一）
CentOS下搭建Teuthology Ceph自动化测试平台（二）
CentOS下搭建Teuthology Ceph自动化测试平台（三）
CentOS下搭建Teuthology Ceph自动化测试平台（四）
CentOS下搭建Teuthology Ceph自动化测试平台（五）
Teuthology节点的部署——Ceph自动化测试平台(六)
Teuthology的使用与Ceph自动化测试用例的编写（一）
Teuthology的使用与Ceph自动化测试用例的编写（二）

Teuthology常用的task
Install
- Install packages for a given project.
- Upgrades packages for a given project
ceph
interactive
- Run an interactive Python shell, with the cluster accessible via the ctx variable
rados
overrides
workunit
mon_thrash
thrashosd
mon_clock_skew_check

Install

Install packages for a given project.

Install packages for a given project.

    tasks:
    - install:
        project: ceph
        branch: bar
    - install:
        project: samba
        branch: foo
        extra_packages: ['samba']
    - install:
        rhbuild: 1.3.0
        playbook: downstream_setup.yml
        vars:
           yum_repos:
             - url: "http://location.repo"
               name: "ceph_repo"

Overrides are project specific:

    overrides:
      install:
        ceph:
          sha1: ...

Debug packages may optionally be installed:

    overrides:
      install:
        ceph:
          debuginfo: true

Default package lists (which come from packages.yaml) may be overridden:

    overrides:
      install:
        ceph:
          packages:
            deb:
            - ceph-osd
            - ceph-mon
            rpm:
            - ceph-devel
            - rbd-fuse

    When tag, branch and sha1 do not reference the same commit hash, the
    tag takes precedence over the branch and the branch takes precedence
over the sha1.

    When the overrides have a sha1 that is different from the sha1 of
    the project to be installed, it will be a noop if the project has
    a branch or tag, because they take precedence over the sha1. For
instance:

    overrides:
      install:
        ceph:
          sha1: 1234
    tasks:
    - install:
        project: ceph
          sha1: 4567
          branch: foobar # which has sha1 4567

    The override will transform the tasks as follows:
    tasks:
    - install:
        project: ceph
          sha1: 1234
          branch: foobar # which has sha1 4567

    But the branch takes precedence over the sha1 and foobar
    will be installed. The override of the sha1 has no effect.
    When passed 'rhbuild' as a key, it will attempt to install an rh ceph build
using ceph-deploy

    Reminder regarding teuthology-suite side effects:
    The teuthology-suite command always adds the following:
    overrides:
      install:
        ceph:
          sha1: 1234

    where sha1 matches the --ceph argument. For instance if
    teuthology-suite is called with --ceph master, the sha1 will be
    the tip of master. If called with --ceph v0.94.1, the sha1 will be
    the v0.94.1 (as returned by git rev-parse v0.94.1 which is not to
    be confused with git rev-parse v0.94.1^{commit})

Upgrades packages for a given project

Upgrades packages for a given project. example: {cmd_parameter} = ceph_deploy_upgrade 


    For example::
        tasks:
        - install.{cmd_parameter}:
             all:
                branch: end

    or specify specific roles::
        tasks:
        - install.{cmd_parameter}:
             mon.a:
                branch: end
             osd.0:
                branch: other

    or rely on the overrides for the target version::
        overrides:
          install:
            ceph:
              sha1: ...
        tasks:
        - install.{cmd_parameter}:
            all:
    (HACK: the overrides will *only* apply the sha1/branch/tag if those
    keys are not present in the config.)

    It is also possible to attempt to exclude packages from the upgrade set:
        tasks:
        - install.{cmd_parameter}:
            exclude_packages: ['ceph-test', 'ceph-test-dbg']

ceph

Set up and tear down a Ceph cluster

Set up and tear down a Ceph cluster.

    For example::
        tasks:
        - ceph:
        - interactive:

    You can also specify what branch to run::
        tasks:
        - ceph:
            branch: foo

    Or a tag::
        tasks:
        - ceph:
            tag: v0.42.13

    Or a sha1::
        tasks:
        - ceph:
            sha1: 1376a5ab0c89780eab39ffbbe436f6a6092314ed

    Or a local source dir::
        tasks:
        - ceph:
            path: /home/sage/ceph

    To capture code coverage data, use::
        tasks:
        - ceph:
            coverage: true

    To use btrfs, ext4, or xfs on the target's scratch disks, use::
        tasks:
        - ceph:
            fs: xfs
            mkfs_options: [-b,size=65536,-l,logdev=/dev/sdc1]
            mount_options: [nobarrier, inode64]

    Note, this will cause the task to check the /scratch_devs file on each node
    for available devices.  If no such file is found, /dev/sdb will be used.
    To run some daemons under valgrind, include their names
    and the tool/args to use in a valgrind section::

        tasks:
        - ceph:
          valgrind:
            mds.1: --tool=memcheck
            osd.1: [--tool=memcheck, --leak-check=no]

    Those nodes which are using memcheck or valgrind will get
    checked for bad results.
    To adjust or modify config options, use::
        tasks:
        - ceph:
            conf:
              section:
                key: value

    For example::
        tasks:
        - ceph:
            conf:
              mds.0:
                some option: value
                other key: other value
              client.0:
                debug client: 10
                debug ms: 1

    By default, the cluster log is checked for errors and warnings,
    and the run marked failed if any appear. You can ignore log
    entries by giving a list of egrep compatible regexes, i.e.:
        tasks:
        - ceph:
            log-whitelist: ['foo.*bar', 'bad message']

    To run multiple ceph clusters, use multiple ceph tasks, and roles
    with a cluster name prefix, e.g. cluster1.client.0. Roles with no
    cluster use the default cluster name, 'ceph'. OSDs from separate
    clusters must be on separate hosts. Clients and non-osd daemons
    from multiple clusters may be colocated. For each cluster, add an
    instance of the ceph task with the cluster name specified, e.g.::
        roles:
        - [mon.a, osd.0, osd.1]
        - [backup.mon.a, backup.osd.0, backup.osd.1]
        - [client.0, backup.client.0]
        tasks:
        - ceph:
            cluster: ceph
        - ceph:
            cluster: backup

Wait for a failure of a ceph daemon

For example::
      tasks:
      - ceph.wait_for_failure: [mds.*]

      tasks:
      - ceph.wait_for_failure: [osd.0, osd.2]

      tasks:
      - ceph.wait_for_failure:
          daemons: [osd.0, osd.2]

Stop ceph daemons

For example::
      tasks:
      - ceph.stop: [mds.*]

      tasks:
      - ceph.stop: [osd.0, osd.2]

      tasks:
      - ceph.stop:
          daemons: [osd.0, osd.2]

restart ceph daemons

For example::
      tasks:
      - ceph.restart: [all]

   For example::
      tasks:
      - ceph.restart: [osd.0, mon.1, mds.*]

   or::
      tasks:
      - ceph.restart:
          daemons: [osd.0, mon.1]
          wait-for-healthy: false
          wait-for-osds-up: true

interactive

Run an interactive Python shell, with the cluster accessible via the `ctx` variable

Run an interactive Python shell, with the cluster accessible via
    the ``ctx`` variable.
    Hit ``control-D`` to continue.

    This is also useful to pause the execution of the test between two
    tasks, either to perform ad hoc operations, or to examine the
    state of the cluster. You can also use it to easily bring up a
    Ceph cluster for ad hoc testing.

    For example::
        tasks:
        - ceph:
        - interactive:

rados

Run RadosModel-based integration tests. 进行所有读，写，快照，回滚，纠删码，克隆等基本内部单元测试。实际会调用ceph/src/test下面的osd/TestRados.cc。

The config should be as follows::
        rados:
          clients: [client list]
          ops: <number of ops>
          objects: <number of objects to use>
          max_in_flight: <max number of operations in flight>
          object_size: <size of objects in bytes>
          min_stride_size: <minimum write stride size in bytes>
          max_stride_size: <maximum write stride size in bytes>
          op_weights: <dictionary mapping operation type to integer weight>
          runs: <number of times to run> - the pool is remade between runs
          ec_pool: use an ec pool
          erasure_code_profile: profile to use with the erasure coded pool
          fast_read: enable ec_pool's fast_read
          min_size: set the min_size of created pool
          pool_snaps: use pool snapshots instead of selfmanaged snapshots
          write_fadvise_dontneed: write behavior like with LIBRADOS_OP_FLAG_FADVISE_DONTNEED.This mean data don't access in the near future.Let osd backend don't keep data in cache.

    For example::
        tasks:
        - ceph:
        - rados:
            clients: [client.0]
            ops: 1000
            max_seconds: 0   # 0 for no limit
            objects: 25
            max_in_flight: 16
            object_size: 4000000
            min_stride_size: 1024
            max_stride_size: 4096
            op_weights:
              read: 20
              write: 10
              delete: 2
              snap_create: 3
              rollback: 2
              snap_remove: 0
            ec_pool: create an ec pool, defaults to False
            erasure_code_use_overwrites: test overwrites, default false
            erasure_code_profile:
              name: teuthologyprofile
              k: 2
              m: 1
              crush-failure-domain: osd
            pool_snaps: true
        write_fadvise_dontneed: true
            runs: 10
        - interactive:

    Optionally, you can provide the pool name to run against:
        tasks:
        - ceph:
        - exec:
            client.0:
              - ceph osd pool create foo
        - rados:
            clients: [client.0]
            pools: [foo]
            ...
    Alternatively, you can provide a pool prefix:
        tasks:
        - ceph:
        - exec:
            client.0:
              - ceph osd pool create foo.client.0
        - rados:
            clients: [client.0]
            pool_prefix: foo
            ...

    The tests are run asynchronously, they are not complete when the task
    returns. For instance:
        - rados:
            clients: [client.0]
            pools: [ecbase]
            ops: 4000
            objects: 500
            op_weights:
              read: 100
              write: 100
              delete: 50
              copy_from: 50
        - print: "**** done rados ec-cache-agent (part 2)"
     will run the print task immediately after the rados tasks begins but
     not after it completes. To make the rados task a blocking / sequential
     task, use:
        - sequential:
          - rados:
              clients: [client.0]
              pools: [ecbase]
              ops: 4000
              objects: 500
              op_weights:
                read: 100
                write: 100
                delete: 50
                copy_from: 50
        - print: "**** done rados ec-cache-agent (part 2)"

overrides

在overrides下，可以增加高优先级的ceph配置。它可以增加配置，却不修改现有的yaml文件，提高了重用。
overrides: override behavior. Typically, this includes sub-tasks being overridden. Overrides technically is not a task (there is no ‘def task’ in an overrides.py file), but from a user’s standpoint can be described as behaving like one. Sub-tasks can nest further information. For example, overrides of install tasks are project specific, so the following section of a yaml file would cause all ceph installations to default to using the jewel branch:

overrides:
  install:
    ceph:
      branch: jewel

workunit

For example::
        tasks:
        - ceph:
        - ceph-fuse: [client.0]
        - workunit:
            clients:
              client.0: [direct_io, xattrs.sh]
              client.1: [snaps]
            branch: foo

    You can also run a list of workunits on all clients:
        tasks:
        - ceph:
        - ceph-fuse:
        - workunit:
            tag: v0.47
            clients:
              all: [direct_io, xattrs.sh, snaps]

    If you have an "all" section it will run all the workunits
    on each client simultaneously, AFTER running any workunits specified
    for individual clients. (This prevents unintended simultaneous runs.)
    To customize tests, you can specify environment variables as a dict. You
    can also specify a time limit for each work unit (defaults to 3h):
        tasks:
        - ceph:
        - ceph-fuse:
        - workunit:
            sha1: 9b28948635b17165d17c1cf83d4a870bd138ddf6
            clients:
              all: [snaps]
            env:
              FOO: bar
              BAZ: quux
            timeout: 3h

    This task supports roles that include a ceph cluster, e.g.::
        tasks:
        - ceph:
        - workunit:
            clients:
              backup.client.0: [foo]
              client.1: [bar] # cluster is implicitly 'ceph'

    You can also specify an alternative top-level dir to 'qa/workunits', like
    'qa/standalone', with::
        tasks:
        - install:
        - workunit:
            basedir: qa/standalone
            clients:
              client.0:
                - test-ceph-helpers.sh

mon_thrash

How it works::
    - pick a monitor
    - kill it
    - wait for quorum to be formed
    - sleep for 'revive_delay' seconds
    - revive monitor
    - wait for quorum to be formed
- sleep for 'thrash_delay' seconds

    Options::
    seed                Seed to use on the RNG to reproduce a previous
                        behaviour (default: None; i.e., not set)

    revive_delay        Number of seconds to wait before reviving
                        the monitor (default: 10)

    thrash_delay        Number of seconds to wait in-between
                        test iterations (default: 0)

thrash_store        Thrash monitor store before killing the monitor being thrashed (default: False)

    thrash_store_probability  Probability of thrashing a monitor's store
                              (default: 50)

    thrash_many         Thrash multiple monitors instead of just one. If
                        'maintain-quorum' is set to False, then we will
                        thrash up to as many monitors as there are
                        available. (default: False)

    maintain_quorum     Always maintain quorum, taking care on how many
                        monitors we kill during the thrashing. If we
                        happen to only have one or two monitors configured,
                        if this option is set to True, then we won't run
                        this task as we cannot guarantee maintenance of
                        quorum. Setting it to false however would allow the
                        task to run with as many as just one single monitor.
                        (default: True)

    freeze_mon_probability: how often to freeze the mon instead of killing it,
                        in % (default: 0)

freeze_mon_duration: how many seconds to freeze the mon (default: 15)

scrub               Scrub after each iteration (default: True)

    Note: if 'store-thrash' is set to True, then 'maintain-quorum' must also
          be set to True.

    For example::
    tasks:
    - ceph:
    - mon_thrash:
        revive_delay: 20
        thrash_delay: 1
        thrash_store: true
        thrash_store_probability: 40
        seed: 31337
        maintain_quorum: true
        thrash_many: true
    - ceph-fuse:
    - workunit:
        clients:
          all:
            - mon/workloadgen.sh

thrashosd

“Thrash” the OSDs by randomly marking them out/down (and then back in) until the task is ended. This loops, and every op_delay seconds it randomly chooses to add or remove an OSD (even odds) unless there are fewer than min_out OSDs out of the cluster, or more than min_in OSDs in the cluster.

All commands are run on mon0 and it stops when __exit__ is called.
    The config is optional, and is a dict containing some or all of:
    cluster: (default 'ceph') the name of the cluster to thrash
    min_in: (default 4) the minimum number of OSDs to keep in the cluster
    min_out: (default 0) the minimum number of OSDs to keep out of the cluster
    op_delay: (5) the length of time to sleep between changing an OSD's status
    min_dead: (0) minimum number of osds to leave down/dead.
    max_dead: (0) maximum number of osds to leave down/dead before waiting
               for clean.  This should probably be num_replicas - 1.
    clean_interval: (60) the approximate length of time to loop before
       waiting until the cluster goes clean. (In reality this is used
       to probabilistically choose when to wait, and the method used
       makes it closer to -- but not identical to -- the half-life.)

    scrub_interval: (-1) the approximate length of time to loop before
       waiting until a scrub is performed while cleaning. (In reality
       this is used to probabilistically choose when to wait, and it
       only applies to the cases where cleaning is being performed).
       -1 is used to indicate that no scrubbing will be done.

    chance_down: (0.4) the probability that the thrasher will mark an
       OSD down rather than marking it out. (The thrasher will not
       consider that OSD out of the cluster, since presently an OSD
       wrongly marked down will mark itself back up again.) This value
       can be either an integer (eg, 75) or a float probability (eg
       0.75).

    chance_test_min_size: (0) chance to run test_pool_min_size,
       which:
       - kills all but one osd
       - waits
       - kills that osd
       - revives all other osds
       - verifies that the osds fully recover

    timeout: (360) the number of seconds to wait for the cluster
       to become clean after each cluster change. If this doesn't
       happen within the timeout, an exception will be raised.

    revive_timeout: (150) number of seconds to wait for an osd asok to
       appear after attempting to revive the osd

    thrash_primary_affinity: (true) randomly adjust primary-affinity
    chance_pgnum_grow: (0) chance to increase a pool's size
    chance_pgpnum_fix: (0) chance to adjust pgpnum to pg for a pool
    pool_grow_by: (10) amount to increase pgnum by
    max_pgs_per_pool_osd: (1200) don't expand pools past this size per osd
    pause_short: (3) duration of short pause
    pause_long: (80) duration of long pause
    pause_check_after: (50) assert osd down after this long
    chance_inject_pause_short: (1) chance of injecting short stall
    chance_inject_pause_long: (0) chance of injecting long stall
    clean_wait: (0) duration to wait before resuming thrashing once clean
    sighup_delay: (0.1) duration to delay between sending signal.SIGHUP to a
                  random live osd
    powercycle: (false) whether to power cycle the node instead
        of just the osd process. Note that this assumes that a single
        osd is the only important process on the node.

    bdev_inject_crash: (0) seconds to delay while inducing a synthetic crash.
        the delay lets the BlockDevice "accept" more aio operations but blocks
        any flush, and then eventually crashes (losing some or all ios).  If 0,no bdev failure injection is enabled.

    bdev_inject_crash_probability: (.5) probability of doing a bdev failure
        injection crash vs a normal OSD kill.

    chance_test_backfill_full: (0) chance to simulate full disks stopping
        Backfill

    chance_test_map_discontinuity: (0) chance to test map discontinuity
    map_discontinuity_sleep_time: (40) time to wait for map trims
ceph_objectstore_tool: (true) whether to export/import a pg while an osd is down

chance_move_pg: (1.0) chance of moving a pg if more than 1 osd is down (default 100%)

    optrack_toggle_delay: (2.0) duration to delay between toggling op tracker
                  enablement to all osds

    dump_ops_enable: (true) continuously dump ops on all live osds
noscrub_toggle_delay: (2.0) duration to delay between toggling noscrub

disable_objectstore_tool_tests: (false) disable ceph_objectstore_tool 
based tests

    chance_thrash_cluster_full: .05
    chance_thrash_pg_upmap: 1.0
chance_thrash_pg_upmap_items: 1.0

    example:
    tasks:
    - ceph:
    - thrashosds:
        cluster: ceph
        chance_down: 10
        op_delay: 3
        min_in: 1
        timeout: 600
    - interactive:

mon_clock_skew_check

Check if there are any clock skews among the monitors in the quorum.

This task accepts the following options:
    interval     amount of seconds to wait before check. (default: 30.0)
    expect-skew  'true' or 'false', to indicate whether to expect a skew during
                 the run or not. If 'true', the test will fail if no skew is
                 found, and succeed if a skew is indeed found; if 'false', it's
                 the other way around. (default: false)
    - mon_clock_skew_check:
        expect-skew: true

Teuthology的使用与Ceph自动化测试用例的编写（二）