[Bug 236989] AWS EC2 lockups "Missing interrupt"
bugzilla-noreply at freebsd.org
bugzilla-noreply at freebsd.org
Thu May 7 17:31:01 UTC 2020
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=236989
--- Comment #24 from Charles O'Donnell <cao at bus.net> ---
New development. See three notes below.
N.B. the system appears to have fully recovered. Normally I would have expected
a freeze.
1. CPU alarm from a custom AWS monitor at 16:43 UTC (12:43 PM ET):
Alarm Details:
- Name: Starch CPU
- Description:
- State Change: OK -> ALARM
- Reason for State Change: Threshold Crossed: 1 datapoint [31.4 (07/05/20
16:38:00)] was greater than or equal to the threshold (30.0).
- Timestamp: Thursday 07 May, 2020 16:43:35 UTC
- AWS Account: 539612714288
- Alarm Arn:
arn:aws:cloudwatch:us-east-1:539612714288:alarm:Starch CPU
Threshold:
- The alarm is in the ALARM state when the metric is
GreaterThanOrEqualToThreshold 30.0 for 300 seconds.
2. Sudden jump in failed 9k mbufs between 12:00 and 13:00 ET:
===> Thu May 7 10:00:00 EDT 2020
mbuf_jumbo_page: 4096, 490945, 0, 56,45464111, 0, 0
mbuf_jumbo_9k: 9216, 145465, 7538, 450,66361278,1640, 0
mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0
dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0
===> Thu May 7 11:00:00 EDT 2020
mbuf_jumbo_page: 4096, 490945, 16, 113,45658689, 0, 0
mbuf_jumbo_9k: 9216, 145465, 7592, 397,66645310,1642, 0
mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0
dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0
===> Thu May 7 12:00:00 EDT 2020
mbuf_jumbo_page: 4096, 490945, 182, 31,45730287, 0, 0
mbuf_jumbo_9k: 9216, 145465, 7461, 259,66753693,1693, 0
mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0
dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0
===> Thu May 7 13:00:00 EDT 2020
mbuf_jumbo_page: 4096, 490945, 119, 109,46249719, 0, 0
mbuf_jumbo_9k: 9216, 145465, 7863, 207,67594999,2577, 0
mbuf_jumbo_16k: 16384, 81824, 0, 0, 0, 0, 0
dev.ena.0.queue7.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue6.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue5.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue4.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue3.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue2.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue1.rx_ring.mjum_alloc_fail: 0
dev.ena.0.queue0.rx_ring.mjum_alloc_fail: 0
3: ena0 reset at 12:43 ET:
May 7 12:43:19 s4 kernel: ena0: The number of lost tx completion is above the
threshold (129 > 128). Reset the device
May 7 12:43:19 s4 kernel: ena0: Trigger reset is on
May 7 12:43:19 s4 kernel: ena0: device is going DOWN
May 7 12:43:22 s4 kernel: ena0: free uncompleted tx mbuf qid 3 idx 0x319ena0:
free uncompleted tx mbuf qid 7 idx 0x173
May 7 12:43:23 s4 kernel: ena0: ena0: device is going UP
May 7 12:43:23 s4 kernel: link is UP
May 7 12:45:00 s4 kernel: ena0: The number of lost tx completion is above the
threshold (129 > 128). Reset the device
May 7 12:45:00 s4 kernel: ena0: Trigger reset is on
May 7 12:45:00 s4 kernel: ena0: device is going DOWN
May 7 12:45:04 s4 kernel: ena0: free uncompleted tx mbuf qid 3 idx 0x102
May 7 12:45:04 s4 kernel: ena0: ena0: device is going UP
May 7 12:45:04 s4 kernel: link is UP
May 7 12:45:26 s4 kernel: ena0: The number of lost tx completion is above the
threshold (129 > 128). Reset the device
May 7 12:45:26 s4 kernel: ena0: Trigger reset is on
May 7 12:45:26 s4 kernel: ena0: device is going DOWN
May 7 12:45:29 s4 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x3c7ena0:
free uncompleted tx mbuf qid 2 idx 0x2c5ena0: free uncompleted tx mbuf qid 6
idx 0x2abena0: free uncompleted tx mbuf qid 7 idx 0x241
May 7 12:45:30 s4 kernel:
May 7 12:45:30 s4 kernel: stray irq265
May 7 12:45:30 s4 kernel: ena0: ena0: device is going UP
May 7 12:45:30 s4 kernel: link is UP
May 7 12:46:05 s4 kernel: ena0: Keep alive watchdog timeout.
May 7 12:46:05 s4 kernel: ena0: Trigger reset is on
May 7 12:46:05 s4 kernel: ena0: device is going DOWN
May 7 12:46:07 s4 kernel: ena0: free uncompleted tx mbuf qid 1 idx 0x123ena0:
free uncompleted tx mbuf qid 3 idx 0xeeena0: free uncompleted tx mbuf qid 6 idx
0x208
May 7 12:46:08 s4 kernel: ena0: ena0: device is going UP
May 7 12:46:08 s4 kernel: link is UP
May 7 12:46:36 s4 kernel: ena0: The number of lost tx completion is above the
threshold (129 > 128). Reset the device
May 7 12:46:36 s4 kernel: ena0: Trigger reset is on
May 7 12:46:36 s4 kernel: ena0: device is going DOWN
May 7 12:46:37 s4 kernel: ena0: free uncompleted tx mbuf qid 0 idx 0x2c2ena0:
free uncompleted tx mbuf qid 1 idx 0x135ena0: free uncompleted tx mbuf qid 2
idx 0xeeena0: free uncompleted tx mbuf qid 3 idx 0x373ena0: free uncompleted tx
mbuf qid 4 idx 0x88ena0: free uncompleted t>
May 7 12:46:38 s4 kernel: ena0: ena0: device is going UP
May 7 12:46:38 s4 kernel: link is UP
--
You are receiving this mail because:
You are the assignee for the bug.
More information about the freebsd-virtualization
mailing list