サーバーの負荷状態を把握するために見ておくべき Munin グラフとは

今回は Munin グラフについて考えます

インフラエンジニアの皆さんであればサーバーの状態を把握するために、Cacti, Munin, Zabbix 等のツールを使いロードアベレージ等の数値をグラフ化していることと思います。

今回はどういったグラフを見ておけばよいかを考えます。私の場合は普段は Munin を使っているため、今回は Munin に特化した話です。ただ、グラフに対する考え方についてはすべてのツールで共通ですね。

グラフ化する目的は下記の 2点だと思われます

トラブル発生時の状態把握を迅速に行うため
中長期でのリソースの消費量増加具合を把握し、パフォーマンスチューニングやリソース増強の前提データとする

グラフの用途その1 (トラブル発生時の状態把握を迅速に行うため)

何かサーバートラブルが発生した際には、どこかの箇所で何かの変化が起き、その結果としてサーバートラブルとなっている、ということです。ハードの故障の場合はともかく、急なアクセス増やプログラム修正に伴う負荷増、アタックによる負荷増、時限発火するような落とし穴での負荷増等等、いろいろとあります。

こういった原因でのサーバー負荷状態の変化を把握するためにはどういったグラフを設定しておけばいいのか、考えていきます。

グラフの用途その2 (中長期でのリソースの消費量増加具合を把握)

こちらは逆に長期トレンドでのリソース消費量の変化を見るために使います。

CPU 使用率の増加 (半年で倍になっているので、そろそろサーバー増強が必要)
ディスク使用率の増加 (3ヶ月で 10% 増えているため、ディスク増強等が必要)

Ubuntu 16.04 LTS でデフォルトで設定されている Munin グラフについて

デフォルトでそこそこの数のグラフが有効になっています。

デフォルトは下記の通りです (munin-node パッケージをインストールするとこれらの設定が最初から入ると思われます)

# ls -l /etc/munin/plugins
total 0
lrwxrwxrwx 1 root root 28 Oct 29 12:32 cpu -> /usr/share/munin/plugins/cpu
lrwxrwxrwx 1 root root 27 Oct 29 12:32 df -> /usr/share/munin/plugins/df
lrwxrwxrwx 1 root root 33 Oct 29 12:32 df_inode -> /usr/share/munin/plugins/df_inode
lrwxrwxrwx 1 root root 34 Oct 29 12:32 diskstats -> /usr/share/munin/plugins/diskstats
lrwxrwxrwx 1 root root 32 Oct 29 12:32 entropy -> /usr/share/munin/plugins/entropy
lrwxrwxrwx 1 root root 30 Oct 29 12:32 forks -> /usr/share/munin/plugins/forks
lrwxrwxrwx 1 root root 43 Oct 29 12:32 fw_forwarded_local -> /usr/share/munin/plugins/fw_forwarded_local
lrwxrwxrwx 1 root root 35 Oct 29 12:32 fw_packets -> /usr/share/munin/plugins/fw_packets
lrwxrwxrwx 1 root root 28 Oct 29 12:32 if_enp0s3 -> /usr/share/munin/plugins/if_
lrwxrwxrwx 1 root root 32 Oct 29 12:32 if_err_enp0s3 -> /usr/share/munin/plugins/if_err_
lrwxrwxrwx 1 root root 35 Oct 29 12:32 interrupts -> /usr/share/munin/plugins/interrupts
lrwxrwxrwx 1 root root 33 Oct 29 12:32 irqstats -> /usr/share/munin/plugins/irqstats
lrwxrwxrwx 1 root root 29 Oct 29 12:32 load -> /usr/share/munin/plugins/load
lrwxrwxrwx 1 root root 31 Oct 29 12:32 memory -> /usr/share/munin/plugins/memory
lrwxrwxrwx 1 root root 32 Oct 29 12:32 netstat -> /usr/share/munin/plugins/netstat
lrwxrwxrwx 1 root root 35 Oct 29 12:32 open_files -> /usr/share/munin/plugins/open_files
lrwxrwxrwx 1 root root 36 Oct 29 12:32 open_inodes -> /usr/share/munin/plugins/open_inodes
lrwxrwxrwx 1 root root 33 Oct 29 12:32 proc_pri -> /usr/share/munin/plugins/proc_pri
lrwxrwxrwx 1 root root 34 Oct 29 12:32 processes -> /usr/share/munin/plugins/processes
lrwxrwxrwx 1 root root 29 Oct 29 12:32 swap -> /usr/share/munin/plugins/swap
lrwxrwxrwx 1 root root 32 Oct 29 12:32 threads -> /usr/share/munin/plugins/threads
lrwxrwxrwx 1 root root 31 Oct 29 12:32 uptime -> /usr/share/munin/plugins/uptime
lrwxrwxrwx 1 root root 30 Oct 29 12:32 users -> /usr/share/munin/plugins/users
lrwxrwxrwx 1 root root 31 Oct 29 12:32 vmstat -> /usr/share/munin/plugins/vmstat

追加で入れておくといいと思われるグラフについて

私の場合は、これらに加えて下記のグラフを追加しています。

fw_conntrack
netstat_multi
tcp

さらに使っているミドルウェアに応じてそれぞれのグラフを追加しておくと良いと思います。たとえば以下のとおりです。

nginx_status
nginx_request
apache_accesses
apache_activity (https://github.com/munin-monitoring/contrib/blob/master/plugins/apache/apache_activity)
mysql_commands
mysql_connections
mysql_files_tables
mysql_innodb_bpool
mysql_innodb_bpool_act
mysql_innodb_insert_buf
mysql_innodb_io
mysql_innodb_io_pend
mysql_innodb_log
mysql_innodb_rows
mysql_innodb_semaphores
mysql_innodb_tnx
mysql_network_traffic
mysql_qcache
mysql_qcache_mem
mysql_select_types
mysql_slow
mysql_sorts
mysql_table_locks
mysql_tmp_tables

その他、Redis, Memcached 等を使っている場合は対応するグラフを入れておくといいです。

私が使っている VPS の設定具合は下記のとおりです。 nginx, MySQL, 自前のアプリケーションサーバーが動作しています。

アクセス数がそれほど多くないため、nginx 関連のグラフについてはまだ入れていません。

# ls -l /etc/munin/plugins/
total 0
lrwxrwxrwx 1 root root 28 Oct 28 21:28 cpu -> /usr/share/munin/plugins/cpu
lrwxrwxrwx 1 root root 27 Oct 28 21:28 df -> /usr/share/munin/plugins/df
lrwxrwxrwx 1 root root 33 Oct 28 21:28 df_inode -> /usr/share/munin/plugins/df_inode
lrwxrwxrwx 1 root root 34 Oct 28 21:28 diskstats -> /usr/share/munin/plugins/diskstats
lrwxrwxrwx 1 root root 32 Oct 28 21:28 entropy -> /usr/share/munin/plugins/entropy
lrwxrwxrwx 1 root root 30 Oct 28 21:28 forks -> /usr/share/munin/plugins/forks
lrwxrwxrwx 1 root root 37 Oct 28 23:29 fw_conntrack -> /usr/share/munin/plugins/fw_conntrack
lrwxrwxrwx 1 root root 35 Oct 28 21:28 fw_packets -> /usr/share/munin/plugins/fw_packets
lrwxrwxrwx 1 root root 28 Oct 28 21:28 if_ens3 -> /usr/share/munin/plugins/if_
lrwxrwxrwx 1 root root 28 Oct 28 21:28 if_ens4 -> /usr/share/munin/plugins/if_
lrwxrwxrwx 1 root root 28 Oct 28 21:28 if_ens5 -> /usr/share/munin/plugins/if_
lrwxrwxrwx 1 root root 32 Oct 28 21:28 if_err_ens3 -> /usr/share/munin/plugins/if_err_
lrwxrwxrwx 1 root root 32 Oct 28 21:28 if_err_ens4 -> /usr/share/munin/plugins/if_err_
lrwxrwxrwx 1 root root 32 Oct 28 21:28 if_err_ens5 -> /usr/share/munin/plugins/if_err_
lrwxrwxrwx 1 root root 35 Oct 28 21:28 interrupts -> /usr/share/munin/plugins/interrupts
lrwxrwxrwx 1 root root 33 Oct 28 21:28 irqstats -> /usr/share/munin/plugins/irqstats
lrwxrwxrwx 1 root root 29 Oct 28 21:28 load -> /usr/share/munin/plugins/load
lrwxrwxrwx 1 root root 31 Oct 28 21:28 memory -> /usr/share/munin/plugins/memory
lrwxrwxrwx 1 root root 31 Oct 28 22:46 mysql_commands -> /usr/share/munin/plugins/mysql_
lrwxrwxrwx 1 root root 31 Oct 28 23:44 mysql_connections -> /usr/share/munin/plugins/mysql_
lrwxrwxrwx 1 root root 31 Oct 28 22:48 mysql_files_tables -> /usr/share/munin/plugins/mysql_
lrwxrwxrwx 1 root root 31 Oct 28 22:48 mysql_innodb_bpool -> /usr/share/munin/plugins/mysql_
lrwxrwxrwx 1 root root 31 Oct 28 22:48 mysql_innodb_bpool_act -> /usr/share/munin/plugins/mysql_
lrwxrwxrwx 1 root root 31 Oct 28 22:48 mysql_innodb_insert_buf -> /usr/share/munin/plugins/mysql_
lrwxrwxrwx 1 root root 31 Oct 28 22:48 mysql_innodb_io -> /usr/share/munin/plugins/mysql_
lrwxrwxrwx 1 root root 31 Oct 28 22:48 mysql_innodb_io_pend -> /usr/share/munin/plugins/mysql_
lrwxrwxrwx 1 root root 31 Oct 28 22:48 mysql_innodb_log -> /usr/share/munin/plugins/mysql_
lrwxrwxrwx 1 root root 31 Oct 28 22:48 mysql_innodb_rows -> /usr/share/munin/plugins/mysql_
lrwxrwxrwx 1 root root 31 Oct 28 22:48 mysql_innodb_semaphores -> /usr/share/munin/plugins/mysql_
lrwxrwxrwx 1 root root 31 Oct 28 22:48 mysql_innodb_tnx -> /usr/share/munin/plugins/mysql_
lrwxrwxrwx 1 root root 31 Oct 28 22:48 mysql_network_traffic -> /usr/share/munin/plugins/mysql_
lrwxrwxrwx 1 root root 31 Oct 28 22:48 mysql_qcache -> /usr/share/munin/plugins/mysql_
lrwxrwxrwx 1 root root 31 Oct 28 22:49 mysql_qcache_mem -> /usr/share/munin/plugins/mysql_
lrwxrwxrwx 1 root root 31 Oct 28 22:49 mysql_select_types -> /usr/share/munin/plugins/mysql_
lrwxrwxrwx 1 root root 31 Oct 28 22:49 mysql_slow -> /usr/share/munin/plugins/mysql_
lrwxrwxrwx 1 root root 31 Oct 28 22:49 mysql_sorts -> /usr/share/munin/plugins/mysql_
lrwxrwxrwx 1 root root 31 Oct 28 22:49 mysql_table_locks -> /usr/share/munin/plugins/mysql_
lrwxrwxrwx 1 root root 31 Oct 28 22:49 mysql_tmp_tables -> /usr/share/munin/plugins/mysql_
lrwxrwxrwx 1 root root 32 Oct 28 21:28 netstat -> /usr/share/munin/plugins/netstat
lrwxrwxrwx 1 root root 38 Oct 28 22:38 netstat_multi -> /usr/share/munin/plugins/netstat_multi
lrwxrwxrwx 1 root root 35 Oct 28 21:28 open_files -> /usr/share/munin/plugins/open_files
lrwxrwxrwx 1 root root 36 Oct 28 21:28 open_inodes -> /usr/share/munin/plugins/open_inodes
lrwxrwxrwx 1 root root 33 Oct 28 21:28 proc_pri -> /usr/share/munin/plugins/proc_pri
lrwxrwxrwx 1 root root 34 Oct 28 21:28 processes -> /usr/share/munin/plugins/processes
lrwxrwxrwx 1 root root 29 Oct 28 21:28 swap -> /usr/share/munin/plugins/swap
lrwxrwxrwx 1 root root 28 Oct 28 22:38 tcp -> /usr/share/munin/plugins/tcp
lrwxrwxrwx 1 root root 32 Oct 28 21:28 threads -> /usr/share/munin/plugins/threads
lrwxrwxrwx 1 root root 31 Oct 28 21:28 uptime -> /usr/share/munin/plugins/uptime
lrwxrwxrwx 1 root root 30 Oct 28 21:28 users -> /usr/share/munin/plugins/users
lrwxrwxrwx 1 root root 31 Oct 28 21:28 vmstat -> /usr/share/munin/plugins/vmstat

グラフごとの例

CPU

f:id:neinvalli:20171031020211p:plain

あまり説明はいらないかと思います。

user が上がっている場合はプログラムが CPU を使っています
system が上がっている場合は、通信やメモリの確保等のカーネルランドの処理が CPU を使っています
iowait が上がっている場合は、ディスク IO が発生しています

メモリ

f:id:neinvalli:20171031020304p:plain

apps がアプリが使っている実メモリ量です。ここが一番重要。
cache はページキャッシュです。これが実際にどの程度有効活用されているかどうかは echo 3 > /proc/sys/vm/drop_caches してディスク IO がどうなるか確認する必要があります
active, inactive についてはそれが cache, apps どちらなのかわかるため、その点は改善点かもしれません
swap の量が増減している場合は注意が必要です。

ロードアベレージ

f:id:neinvalli:20171031020325p:plain

これも説明不要かと思います。急に増加している場合は、アクセス増やプロセスの暴走やディスク IO のどれかが起こっていることが大半です。

ディスク IO

f:id:neinvalli:20171031020348p:plain f:id:neinvalli:20171031020404p:plain f:id:neinvalli:20171031020416p:plain

特に HDD を使っている場合はディスク IO がボトルネックとなる場合が多いです。急にロードアベレージが上がった場合等はチェックすると良いグラフかもしれません。大体の場合がサーバー上にログインして vmstat, iostat で iowait が多いことに気づくことのほうが多いです...

swap については、swapin, swapout の際のディスク IO が原因でロードアベレージが爆発するケースがあるため、見ておくと良いかもしれません。ただ、ほとんどの場合はメモリのグラフで swap が増加しているのを見て「うわー swap 出てますねー」となるケースが多いです...

プロセス数、コンテキストスイッチ

f:id:neinvalli:20171031020431p:plain f:id:neinvalli:20171031020442p:plain

これらの数も負荷高騰時に確認する項目です。

プロセス数が増加するということは、どこかで何かの処理が詰まっている可能性があります。
コンテキストスイッチが増加している場合は大量のプロセスが処理をしている際に見られる現象です。

ただしこれらのグラフも負荷高騰後にサーバーログインして vmstat, ps で気づく場合が多いです。

TCP ソケット

f:id:neinvalli:20171031020458p:plain

負荷増ではなく、HTTP のエラー件数が多くなったり、サーバーのレスポンスが悪化した際に確認することが多いです。アプリケーションサーバーから Memcached, Redis, MySQL 等のミドルウェアへ接続している場合で、ウェブサーバーのレスポンスが悪化した場合は、最初にどこかが詰まり、その後他の箇所も接続がたまってしまうため、原因ではないサーバーの established が大量に出ることがあります。

特に重要なのが、サーバー発の TCP コネクション(Memcached, MySQL 宛等)を大量に張っている場合です。 TIME_WAIT 状態のソケットが溜まり、TCP ローカルポートが枯渇してしまう可能性があります。まずはローカルポートの範囲を広げるのが望ましいです(net.ipv4.ip_local_port_range) これについては機会があれば言及したいと思います。

アクセスを受けるだけのサーバーであっても TIME_WAIT の最大値がありますので、これは上げておくのが望ましいです。(net.ipv4.tcp_max_tw_buckets)

netfilter conntrack テーブル

f:id:neinvalli:20171031020512p:plain

iptables state モジュールで iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT 等の設定を入れた場合、直近でどういったパケットのやりとりがあったかを記憶しておくテーブルの使用量です。これが大量にある場合は、最大値を増やしたり(net.netfilter.nf_conntrack_max)、タイムアウト値を短くすること(net.netfilter.nf_conntrack_generic_timeout)を検討したほうがいいかもしれません。サーバーにつながりにくくなって、dmesg や syslog にて ip_conntrack: table full, dropping packet のメッセージが出るあれです。

Ubuntu 14.04 までだと net.netfilter.nf_conntrack_tcp_timeout_time_wait 等の項目ごとにタイムアウト値があったのですが、16.04 のカーネルだと net.netfilter.nf_conntrack_generic_timeout の項目しか見当たりません。要調査です。

ディスク使用量

f:id:neinvalli:20171031020529p:plain f:id:neinvalli:20171031020539p:plain

使用量と言っても

ディスクサイズ
inode 使用量

の 2パターンがあるため、注意が必要です。

ディスクサイズのほうは監視していることが大半ですが、inode 使用量の監視を忘れていて、ハマることがたまにあります。「あれ、書き込みエラーになる。ディスクまだ空きあるのに。あー、inode 枯渇してる」というようなやつです。デバッグ向けのログファイル等が大量に出来てしまい、ディスクサイズはまだまだ余裕なのに inode が枯渇してしまうというようなことがあるので油断はできません。