Linux系统诊断——拷贝日志系统
我们发布一个软件后,总是避免不了会出现各种bug,首先增加软件的运行日志是一个很有效的方法,通过软件运行的日志信息进行分析可以解决绝大部分的问题。
但是有时候软件的崩溃或者卡顿不仅仅是软件本身的问题,还可能和系统有关,这就要求导出日志时,不仅需要导出软件本身的日志,还需要导出系统相关的日志。
目录
- 1.关于系统日志
- 1.1 CPU温度监控
- 1.2 内存监控
- 1.3 网络设备状态监控
- 1.4 中断信息查询
- 1.5 内核日志
- 1.6 重要事件记录
- 1.7 查询指定运行的进程信息
- 1.8 系统日志
- 1.9 图形显示状态监控
- 2.示例
1.关于系统日志
1.1 CPU温度监控
避免因CPU温度过高导致系统宕机
- temperature_log.csv
通过sensors
输出
2025-07-22 15:36:15,+78.0°C,+78.0°C,+71.0°C,+71.0°C,
2025-07-22 15:38:15,+79.0°C,+79.0°C,+71.0°C,+71.0°C,
2025-07-22 15:40:15,+80.0°C,+80.0°C,+71.0°C,+72.0°C,
2025-07-22 15:42:15,+79.0°C,+79.0°C,+71.0°C,+71.0°C,
2025-07-22 15:44:15,+80.0°C,+80.0°C,+71.0°C,+72.0°C,
2025-07-22 15:46:15,+79.0°C,+79.0°C,+71.0°C,+71.0°C,
2025-07-22 15:48:15,+79.0°C,+79.0°C,+71.0°C,+71.0°C,
2025-07-22 15:50:16,+79.0°C,+79.0°C,+71.0°C,+71.0°C,
2025-07-22 15:52:16,+79.0°C,+79.0°C,+71.0°C,+71.0°C,
2025-07-22 15:54:16,+79.0°C,+79.0°C,+71.0°C,+71.0°C,
2025-07-22 15:56:16,+81.0°C,+81.0°C,+71.0°C,+72.0°C,
2025-07-22 15:58:16,+80.0°C,+80.0°C,+71.0°C,+72.0°C,
2025-07-22 16:00:16,+80.0°C,+80.0°C,+71.0°C,+72.0°C,
2025-07-22 16:02:17,+79.0°C,+79.0°C,+71.0°C,+71.0°C,
2025-07-22 16:04:17,+79.0°C,+79.0°C,+71.0°C,+71.0°C,
2025-07-22 16:06:17,+79.0°C,+79.0°C,+71.0°C,+71.0°C,
1.2 内存监控
避免因系统内容占满导致程序崩溃。
$ df
Filesystem 1K-blocks Used Available Use% Mounted on
tmpfs 808404 2184 806220 1% /run
/dev/sda3 81467832 73466136 3817428 96% /
tmpfs 4042016 0 4042016 0% /dev/shm
tmpfs 5120 4 5116 1% /run/lock
/dev/sda2 524252 6228 518024 2% /boot/efi
tmpfs 808400 128 808272 1% /run/user/1000
1.3 网络设备状态监控
查看硬盘、内存、网络接口等设备的状态,或者在系统出现故障时查看相关的错误消息。
$ dmesg
[ 0.000000] Linux version 6.8.0-79-generic (buildd@lcy02-amd64-071) (x86_64-linux-gnu-gcc-12 (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #79~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 15 16:54:53 UTC 2 (Ubuntu 6.8.0-79.79~22.04.1-generic 6.8.12)
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.0-79-generic root=UUID=5c5d3d7c-b15d-40b9-bd6d-dbccc23edca0 ro quiet splash
[ 0.000000] KERNEL supported cpus:
[ 0.000000] Intel GenuineIntel
[ 0.000000] AMD AuthenticAMD
[ 0.000000] Hygon HygonGenuine
[ 0.000000] Centaur CentaurHauls
[ 0.000000] zhaoxin Shanghai
[ 0.000000] Disabled fast string operations
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009e7ff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009e800-0x000000000009ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000000dc000-0x00000000000fffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000bfecffff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000bfed0000-0x00000000bfefefff] ACPI data
[ 0.000000] BIOS-e820: [mem 0x00000000bfeff000-0x00000000bfefffff] ACPI NVS
[ 0.000000] BIOS-e820: [mem 0x00000000bff00000-0x00000000bfffffff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000f0000000-0x00000000f7ffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fec00000-0x00000000fec0ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fee00000-0x00000000fee00fff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fffe0000-0x00000000ffffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000023fffffff] usable
[ 0.000000] NX (Execute Disable) protection: active
[ 0.000000] APIC: Static calls initialized
[ 0.000000] SMBIOS 2.7 present.
[ 0.000000] DMI: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/22/2020
1.4 中断信息查询
包含了系统中所有中断的信息,包括每个 CPU 处理的中断数量、每种中断类型的中断次数等。
$ cat /proc/interruptsCPU0 CPU1 CPU2 CPU3 0: 27 0 0 0 IO-APIC 2-edge timer1: 629 8 1663 1436 IO-APIC 1-edge i80428: 0 0 0 0 IO-APIC 8-edge rtc09: 0 0 0 0 IO-APIC 9-fasteoi acpi12: 918 178 163451 15408 IO-APIC 12-edge i804214: 0 0 0 0 IO-APIC 14-edge ata_piix15: 0 0 0 0 IO-APIC 15-edge ata_piix16: 0 307590 0 768 IO-APIC 16-fasteoi vmwgfx, snd_ens1371, ens3817: 180736 0 8738 0 IO-APIC 17-fasteoi ehci_hcd:usb2, ioc018: 91 51 0 0 IO-APIC 18-fasteoi uhci_hcd:usb119: 0 0 211 1219507 IO-APIC 19-fasteoi ens3324: 0 0 0 0 PCI-MSI-0000:00:15.0 0-edge PCIe PME, pciehp25: 0 0 0 0 PCI-MSI-0000:00:15.1 0-edge PCIe PME, pciehp26: 0 0 0 0 PCI-MSI-0000:00:15.2 0-edge PCIe PME, pciehp27: 0 0 0 0 PCI-MSI-0000:00:15.3 0-edge PCIe PME, pciehp28: 0 0 0 0 PCI-MSI-0000:00:15.4 0-edge PCIe PME, pciehp29: 0 0 0 0 PCI-MSI-0000:00:15.5 0-edge PCIe PME, pciehp30: 0 0 0 0 PCI-MSI-0000:00:15.6 0-edge PCIe PME, pciehp31: 0 0 0 0 PCI-MSI-0000:00:15.7 0-edge PCIe PME, pciehp32: 0 0 0 0 PCI-MSI-0000:00:16.0 0-edge PCIe PME, pciehp33: 0 0 0 0 PCI-MSI-0000:00:16.1 0-edge PCIe PME, pciehp34: 0 0 0 0 PCI-MSI-0000:00:16.2 0-edge PCIe PME, pciehp35: 0 0 0 0 PCI-MSI-0000:00:16.3 0-edge PCIe PME, pciehp36: 0 0 0 0 PCI-MSI-0000:00:16.4 0-edge PCIe PME, pciehp37: 0 0 0 0 PCI-MSI-0000:00:16.5 0-edge PCIe PME, pciehp38: 0 0 0 0 PCI-MSI-0000:00:16.6 0-edge PCIe PME, pciehp39: 0 0 0 0 PCI-MSI-0000:00:16.7 0-edge PCIe PME, pciehp40: 0 0 0 0 PCI-MSI-0000:00:17.0 0-edge PCIe PME, pciehp41: 0 0 0 0 PCI-MSI-0000:00:17.1 0-edge PCIe PME, pciehp42: 0 0 0 0 PCI-MSI-0000:00:17.2 0-edge PCIe PME, pciehp43: 0 0 0 0 PCI-MSI-0000:00:17.3 0-edge PCIe PME, pciehp44: 0 0 0 0 PCI-MSI-0000:00:17.4 0-edge PCIe PME, pciehp45: 0 0 0 0 PCI-MSI-0000:00:17.5 0-edge PCIe PME, pciehp46: 0 0 0 0 PCI-MSI-0000:00:17.6 0-edge PCIe PME, pciehp47: 0 0 0 0 PCI-MSI-0000:00:17.7 0-edge PCIe PME, pciehp48: 0 0 0 0 PCI-MSI-0000:00:18.0 0-edge PCIe PME, pciehp49: 0 0 0 0 PCI-MSI-0000:00:18.1 0-edge PCIe PME, pciehp50: 0 0 0 0 PCI-MSI-0000:00:18.2 0-edge PCIe PME, pciehp51: 0 0 0 0 PCI-MSI-0000:00:18.3 0-edge PCIe PME, pciehp52: 0 0 0 0 PCI-MSI-0000:00:18.4 0-edge PCIe PME, pciehp53: 0 0 0 0 PCI-MSI-0000:00:18.5 0-edge PCIe PME, pciehp54: 0 0 0 0 PCI-MSI-0000:00:18.6 0-edge PCIe PME, pciehp55: 0 0 0 0 PCI-MSI-0000:00:18.7 0-edge PCIe PME, pciehp56: 0 107 0 0 PCI-MSIX-0000:03:00.0 0-edge xhci_hcd61: 0 0 10707 0 PCI-MSI-0000:02:05.0 0-edge ahci[0000:02:05.0]62: 34 3586 0 0 PCI-MSIX-0000:00:07.7 0-edge vmw_vmci63: 0 0 0 0 PCI-MSIX-0000:00:07.7 1-edge vmw_vmciNMI: 0 0 0 0 Non-maskable interruptsLOC: 2115961 3101570 2475863 2141949 Local timer interruptsSPU: 0 0 0 0 Spurious interruptsPMI: 0 0 0 0 Performance monitoring interruptsIWI: 1 0 7 0 IRQ work interruptsRTR: 0 0 0 0 APIC ICR read retriesRES: 25993 33647 35130 27187 Rescheduling interruptsCAL: 727367 506153 465956 407392 Function call interruptsTLB: 7477 7512 8348 7094 TLB shootdownsTRM: 0 0 0 0 Thermal event interruptsTHR: 0 0 0 0 Threshold APIC interruptsDFR: 0 0 0 0 Deferred Error APIC interruptsMCE: 0 0 0 0 Machine check exceptionsMCP: 66 67 67 67 Machine check pollsERR: 0MIS: 0PIN: 0 0 0 0 Posted-interrupt notification eventNPI: 0 0 0 0 Nested posted-interrupt eventPIW: 0 0 0 0 Posted-interrupt wakeup event
1.5 内核日志
存储在/var/log目录下,记录了内核的活动、启动过程中的信息、驱动加载、硬件检测等内容。
eg.
├── kern.log
├── kern.log.1
├── kern.log.2.gz
├── kern.log.3.gz
├── kern.log.4.gz
1.6 重要事件记录
last 命令会显示系统的登录历史、关机、重启等事件的记录,包括用户登录、会话持续时间、系统重启等详细信息。
$ last
dog tty2 tty2 Tue Aug 27 22:05 - down (01:04)
reboot system boot 6.8.0-40-generic Tue Aug 27 22:01 - 23:09 (01:07)
dog tty2 tty2 Tue Aug 27 21:07 - down (00:04)
reboot system boot 6.8.0-40-generic Tue Aug 27 21:07 - 21:12 (00:05)
dog tty2 tty2 Tue Aug 27 21:01 - down (00:06)
reboot system boot 6.8.0-40-generic Tue Aug 27 21:00 - 21:07 (00:06)
dog tty2 tty2 Tue Aug 27 17:32 - down (00:01)
reboot system boot 6.8.0-40-generic Tue Aug 27 17:32 - 17:34 (00:01)
dog tty2 tty2 Tue Aug 27 15:22 - down (02:10)
reboot system boot 6.8.0-40-generic Tue Aug 27 15:21 - 17:32 (02:10)
dog tty2 tty2 Tue Aug 27 13:58 - down (01:23)
reboot system boot 6.8.0-40-generic Tue Aug 27 13:57 - 15:21 (01:24)
dog tty2 tty2 Tue Aug 27 13:06 - down (00:51)
reboot system boot 6.8.0-40-generic Tue Aug 27 13:05 - 13:57 (00:52)wtmp begins Tue Aug 27 13:05:16 2024
1.7 查询指定运行的进程信息
ps(Process Status)指令用于显示当前系统中正在运行的进程信息。它提供了关于进程的详细数据,包括进程的 ID(PID)、用户、CPU 占用、内存占用、启动时间、命令等。
$ sudo ps -aux | grep MvLogServer*
root 957 1.2 0.0 15516 3080 ? Ssl 15:18 4:43 /opt/MVS/logserver/MvLogServer
sog 102757 0.0 0.0 12624 2560 pts/1 S+ 21:36 0:00 grep --color=auto MvLogServer*
1.8 系统日志
包含了系统运行中的各种事件和消息,包括内核日志、应用程序日志、服务状态、系统错误等。
├── syslog
├── syslog.1
├── syslog.2.gz
1.9 图形显示状态监控
查看这个文件来诊断图形界面的问题,例如显示不正确、图形卡驱动加载失败等。
├── Xorg.0.log
└── Xorg.0.log.old
2.示例
- 查看系统CPU频率:
grep 'cpu MHz' /proc/cpuinfo
- 查看系统CPU型号:
grep 'model name' /proc/cpuinfo
- 查看CPU温度:
cat /sys/class/hwmon/hwmon1/temp2_input
(其中CPU0是temp2,CPU3是temp5,中间类推) - 查看内存信息:
grep 'MemTotal' /proc/meminfo
(主要用到的内存信息:MemTotal、MemFree、Buffers、Cached、Active、Shmem、MemAvailable)(对应free的total、free、buff/cache、used、share、available) - 查看CPU是否在性能模式:
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
假设软件名是boss_dog,
report_20250726_15_27_45
├── config
│ ├── boss_dog.conf
├── db
│ ├── boss_dog.db
├── log
│ ├── boss_dog.log
│ ├── boss_dog.1.log
│ ├── boss_dog.2.log
├── summary.log
└── system├── boss_dog.default├── boss_dog.service.log├── recordSystem│ └── temperature_log.csv├── df├── dmesg├── interrupts├── kern.log├── kern.log.1├── kern.log.2.gz├── kern.log.3.gz├── kern.log.4.gz├── last├── ps├── service.status├── syslog├── syslog.1├── syslog.2.gz├── Xorg.0.log└── Xorg.0.log.old