服务器自动备份和检查硬盘

原创
02/02 15:50
阅读数 73

最近服务器坏了,虽然用btrfs做了raid,但两块硬盘都出问题了,这种倒霉事也让我遇上真是无语。为了今后的幸福生活,考虑对服务器做自动备份和磁盘健康检查,如果硬盘健康度不符合预期,则发送邮件提醒。这样也算是稍微再安全一些。写了个脚本自动维护,在此记录一下。

 

backup.py

#!/usr/bin/env python3



## 备份
import os,sys
import datetime

now = datetime.datetime.now()
copies = 3     # 保留几份镜像
#datetime_format = '%Y-%m-%d_%H_%M'      # 用于测试
datetime_format = '%Y-%m-%d'
backup_folder = '/home/xxxx/Backup/'
source_folder = '/home/xxxx/test/*'

# 删除老备份文件夹
old_folder = (now - datetime.timedelta(days=copies*7)).strftime(datetime_format)
if (os.path.exists(backup_folder + old_folder)):
    os.system('rm -Rf ' + backup_folder + old_folder)

# 创建新备份到目标文件夹
new_folder = now.strftime(datetime_format)
if (not os.path.exists(backup_folder + new_folder)):
    os.system('mkdir ' + backup_folder + new_folder)
# 不进行压缩直接备份
#os.system('cp -R ' + source_folder + ' ' + backup_folder + new_folder)
# 压缩备份
os.system('tar zcf ' + backup_folder + new_folder + '/test.tar.gz ' + source_folder)

print(datetime.datetime.now(), "backup completed successfully!")



## 检查磁盘健康度
import subprocess
import re

command1 = 'echo %s | sudo -S smartctl -l ssd /dev/sdb' % "填写sudo用户密码"
command2 = 'echo %s | sudo -S smartctl -l ssd /dev/sdc' % "填写sudo用户密码"
command3 = 'echo %s | sudo -S btrfs device stats /mnt' % "填写sudo用户密码"
command4 = 'echo %s | sudo -S btrfs scrub start -Bd /mnt' % "填写sudo用户密码"

health1 = 0
health2 = 0
errors = 0
flag = False
try:
    result1 = subprocess.run(command1, shell=True, capture_output=True, text=True)
    result2 = subprocess.run(command2, shell=True, capture_output=True, text=True)
    if (result1.returncode == 0):
        output1 = result1.stdout.splitlines()
        for line in output1:
            # 有的nvme硬盘没有这一项,需要采用其他方式。此处仅为一示例
            if ('Percentage Used Endurance Indicator' in line):
                info = re.split(r'\s+', line)
                health1 = 100 - int(info[3])
                print("disk1 health: ",  health1, "%")
                break
    else:
        print("[disk1] smartctl can not run! error code: ", result1.returncode)
    if (result2.returncode == 0):
        output2 = result2.stdout.splitlines()
        for line in output2:
            # 有的nvme硬盘没有这一项,需要采用其他方式。此处仅为一示例
            if ('Percentage Used Endurance Indicator' in line):
                info = re.split(r'\s+', line)
                health2 = 100 - int(info[3])
                print("disk2 health: ", health2, "%")
                break
    else:
        print("[disk2] smartctl can not run! error code: ", result2.returncode)

    result3 = subprocess.run(command3, shell=True, capture_output=True, text=True)
    if (result3.returncode == 0):
        output3 = result3.stdout.splitlines()
        for line in output3:
            info = re.split(r'\s+', line)
            errors += int(info[1])
        print("total errors: ", errors)
    else:
        print("btrfs device stats can not run! error code: ", result3.returncode)

    result4 = subprocess.run(command4, shell=True, capture_output=True, text=True)
    if (result4.returncode == 0):
        output4 = result4.stdout.splitlines()
        for line in output4:
            if ('Error summary' in line):
                info = re.split(r':\s+', line)
                if (info[1].strip() != 'no errors found'):
                    flag = True
                print(line)
                break
    else:
        print("btrfs scrub can not run! error code: ", result4.returncode)
except FileNotFoundError as e:
    print("smartctl is not installed!", str(e))

#sys.exit()        # 用于测试



## 发送邮件
if (health1 < 100 or health2 < 100 or errors > 0 or flag):
    import smtplib

    smtp_server = 'smtp.xxx.com'
    smtp_port = 465
    mail_sender = 'abc@xxx.com'
    mail_sender_password = 'abc's password'
    
    server = smtplib.SMTP_SSL(smtp_server, smtp_port)
    server.login(mail_sender, mail_sender_password)

    receiver_email = 'xyz@yyy.com'   # 可以是abc@xxx.com,但需自行测试
    message = "\n".join([
        "Subject: [Warning " + str(now) + "] The disks on server are broken!",
        "To: {}".format(receiver_email),
        "From: {}".format(mail_sender),
        "",
        "the disks on your server are healthless, please check and change them as soon as possible!"
        ])

    server.sendmail(mail_sender, receiver_email, message)
    print(datetime.datetime.now(), " Email sent successfully.")

print("duration: ", datetime.datetime.now() - now, "\n")

之后在crontab中创建如下定时任务,每周六凌晨1点进行备份即可。

# m h  dom mon dow   command
0  1   *   *  Sat   /usr/bin/python3 /home/xxx/bin/backup.py >> /home/xxx/log/backup.log 2>&1

 

需要注意的是,此处备份依然在同一台服务器上,只是备份到不同的磁盘上,这同样是不安全的。既然两块硬盘可以一起坏掉,那就难保整台服务器上的所有硬盘都因为某种原因挂掉。原则上应该采用rsync进行远程备份,但因为文件比较大,怕带宽不够导致备份时间长。目前先这样,将来再搞一台备份服务器用来备份好了。

--------------------------------------------------------------

下面介绍一下smartctl的用法。此应用程序用于查看磁盘SMART信息,可用于检查磁盘健康状态。但因为各家硬盘信息可能存在差异,特别是国内各种杂牌小厂的ssd,其SMART信息要么不全,要么就是摆设,所以需要先自行研究一下所用的硬盘SMART信息,再根据总结出来的结果完善上面给出的脚本。

 

要使用smartctl,首先安装smartmontools

sudo apt install smartmontools

可用如下指令进行磁盘检查:

sudo smartctl -t long /dev/sdb     # 长时间测试,若干小时
sudo smartctl -t short /dev/sdb    # 短时间测试,若干分钟

最常用的查看方式有如下三种:

smartctl -a /dev/sdb            # 查看所有信息
smartctl -l devstat /dev/sdb    # 查看设备信息,其中可能包含ssd寿命,但可能不准
smartctl -H /dev/sdb            # 查看健康状态,但可能不准

 

--------------------------------------------------------------------------

接下来顺便列出btrfs的一些修复命令便于查询。以下内容来自:

https://zhuanlan.zhihu.com/p/620733061?utm_id=0

 

基本顺序是先执行scrub检查磁盘是否存在错误

mount /dev/sda1 /mnt       # 挂载硬盘
btrfs scrub start /mnt     # 检查
btrfs scrub status /mnt    # 查看进度

若无法挂载则可直接执行scrub,或者用如下指令挂载

mount -o degraded,usebackuproot /dev/sda1 /mnt

其中degraded用于单块raid1磁盘挂载,否则缺失磁盘无法挂载。usebackuproot则表示在挂载时进行错误修复。

 

然后是尝试check和rescue指令检查并修复

btrfs check /dev/sda1    # 检查磁盘
btrfs rescue super-recover /dev/sda1
btrfs rescue zero-log /dev/sda1
btrfs rescue chunk-recover /dev/sda1

 

实在不行就备份后修复

btrfs restore /dev/sda1 /mnt/backup    # 先备份
btrfs check --repair /dev/sda1         # 后修复

 

--------------------------------------------------------

再把虚拟机自动重启的方法也列一下,以下内容参考如下链接,但其中提到的第一种方法未成功,不知为何,故只采用了第二种方法。

https://www.jianshu.com/p/0c225e914897

 

编写启动服务文件/etc/systemd/system/vname.service。此处vname代表虚拟机名称,也可自己定义,易于理解就好。

[Unit]
Description=vname
After=network.target virtualbox.service
Before=runlevel2.target shutdown.target
[Service]
User=user
Group=vboxusers
Type=forking
Restart=no
TimeoutSec=5min
IgnoreSIGPIPE=no
KillMode=process
GuessMainPID=no
RemainAfterExit=yes
ExecStart=/usr/bin/VBoxManage startvm vname --type headless
ExecStop=/usr/bin/VBoxManage controlvm vname acpipowerbutton
[Install]
WantedBy=multi-user.target

修改上面脚本中的vname和user为合适的虚拟机名称和用户名即可。

 

Reload Daemon

sudo systemctl daemon-reload

开启自启

sudo systemctl enable vname

关闭自启

sudo systemctl disable vname

其他命令

sudo systemctl start vname      # 启动服务
sudo systemctl stop vname       # 停止服务
sudo systemctl status vname     # 查看服务状态

以上指令中的vname为vname.service文件名中的vname。

展开阅读全文
加载中
点击引领话题📣 发布并加入讨论🔥
打赏
0 评论
0 收藏
0
分享
返回顶部
顶部