[EXTERNAL] Re: FreeBSD10 Stable + ZFS + PostgreSQL + SSD performance drop < 24 hours

Mon Jun 12 04:50:49 UTC 2017

Thanks, Jov, for you suggestions.  Per your e-mail I added “explain analyze” to the script:

#!/bin/sh
psql --username=test --password=supersecret -h /db -d test << EOL
\timing on
explain analyze select count(*) from test;
\q
EOL

Sample run of above script before degradation:
Timing is on.
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=3350822.35..3350822.36 rows=1 width=0) (actual time=60234.556..60234.556 rows=1 loops=1)
   ->  Seq Scan on test  (cost=0.00..3296901.08 rows=21568508 width=0) (actual time=1.126..57021.470 rows=21568508 loops=1)
Planning time: 4.968 ms
Execution time: 60234.649 ms
(4 rows)

Time: 60248.503 ms
test$ uptime
10:33PM  up 7 mins, 3 users, load averages: 1.68, 1.79, 0.94

Sample run of above script after degradation (~11.33 hours uptime):
Timing is on.
                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=3350822.35..3350822.36 rows=1 width=0) (actual time=485669.361..485669.361 rows=1 loops=1)
   ->  Seq Scan on test  (cost=0.00..3296901.08 rows=21568508 width=0) (actual time=0.008..483241.253 rows=21568508 loops=1)
Planning time: 0.529 ms
Execution time: 485669.411 ms
(4 rows)

Time: 485670.432 ms
test$ uptime
9:59PM  up 11:21, 2 users, load averages: 1.11, 2.13, 2.14

Regarding dd’ing the pgdata directory, that didn’t work for me as Postgres splits the database up into multiple 2GB files – dd’ing of a 2GB file on a system with 8GB ram doesn’t seem representative.  I opted to create a 16GB  file (dd if=/dev/random of=/testdb/test bs=1m count=16000) on the pertinent ZFS file system then performed dd operation on that:

Sample of run after degradation (~11.66 hours uptime):
16000+0 records in
16000+0 records out
16777216000 bytes transferred in 274.841792 secs (61043176 bytes/sec)
test$ uptime
10:25PM  up 11:46, 2 users, load averages: 1.00, 1.28, 1.59

After rebooting, we can see *MUCH* before performance:
test$ dd if=/testdb/test of=/dev/null bs=1m
16000+0 records in
16000+0 records out
16777216000 bytes transferred in 19.456043 secs (862313883 bytes/sec)
test$ dd if=/testdb/test of=/dev/null bs=1m
16000+0 records in
16000+0 records out
16777216000 bytes transferred in 19.375321 secs (865906473 bytes/sec)
test$ dd if=/testdb/test of=/dev/null bs=1m
16000+0 records in
16000+0 records out
16777216000 bytes transferred in 19.173458 secs (875022968 bytes/sec)
test$ uptime
10:30PM  up 4 mins, 3 users, load averages: 3.52, 1.62, 0.69

These tests were conducted with the previously mentioned Samsung 850 Pro 256GB SSDs (Intel Xeon E31240 with 8GB ram).  There’s essentially nothing else running on this system (99.5-100% idle) and no other disk activity.

Regards,
A

From: Jov [mailto:amutu at amutu.com]
Sent: Sunday, June 11, 2017 5:50 PM
To: Caza, Aaron
Cc: freebsd-hackers at freebsd.org; Allan Jude
Subject: [EXTERNAL] Re: FreeBSD10 Stable + ZFS + PostgreSQL + SSD performance drop < 24 hours

To exclude the fs problem，I will do a dd test on the pgdata data set after the performance drop,if the read and/or write utility can reach 100% or performance expected then I will say the problem is not fs or os.

For pg,what's your output of explain analyze before and after performance drop?

2017年6月12日 12:51 AM，"Caza, Aaron" <Aaron.Caza at ca.weatherford.com<mailto:Aaron.Caza at ca.weatherford.com>>写道：
Thanks Allan for the suggestions.  I tried gstat -d but deletes (d/s) doesn't seem to be it as it stays at 0 despite vfs.zfs.trim.enabled=1.

This is most likely due to the "layering" I use as, for historical reasons, I have GEOM ELI set up to essentially emulate 4k sectors regardless of the underlying media.  I do my own alignment and partition sizing as well as have the ZFS record size set to 8k for Postgres.

In gstat, the SSDs %busy is 90-100% on startup after reboot.  Once the performance degradation hits (<24 hours later), I'm seeing %busy at ~10%.

#!/bin/sh
psql --username=test --password=supersecret -h /db -d test << EOL
\timing on
select count(*) from test;
\q
EOL

Sample run of above script after reboot (before degradation hits) (Samsung 850 Pros in ZFS mirror):
Timing is on.
  count
----------
 21568508
(1 row)

Time: 57029.262 ms

Sample run of above script after degradation (Samsung 850 Pros in ZFS mirror):
Timing is on.
  count
----------
 21568508
(1 row)

Time: 583595.239 ms
(Uptime ~1 day in this particular case.)

Any other suggestions?

Regards,
A

-----Original Message-----
From: owner-freebsd-hackers at freebsd.org<mailto:owner-freebsd-hackers at freebsd.org> [mailto:owner-freebsd-hackers at freebsd.org<mailto:owner-freebsd-hackers at freebsd.org>] On Behalf Of Allan Jude
Sent: Saturday, June 10, 2017 9:40 PM
To: freebsd-hackers at freebsd.org<mailto:freebsd-hackers at freebsd.org>
Subject: [EXTERNAL] Re: FreeBSD10 Stable + ZFS + PostgreSQL + SSD performance drop < 24 hours

On 06/10/2017 12:36, Slawa Olhovchenkov wrote:
> On Sat, Jun 10, 2017 at 04:25:59PM +0000, Caza, Aaron wrote:
>
>> Gents,
>>
>> I'm experiencing an issue where iterating over a PostgreSQL table of ~21.5 million rows (select count(*)) goes from ~35 seconds to ~635 seconds on Intel 540 SSDs.  This is using a FreeBSD 10 amd64 stable kernel back from Jan 2017.  SSDs are basically 2 drives in a ZFS mirrored zpool.  I'm using PostgreSQL 9.5.7.
>>
>> I've tried:
>>
>> *       Using the FreeBSD10 amd64 stable kernel snapshot of May 25, 2017.
>>
>> *       Tested on half a dozen machines with different models of SSDs:
>>
>> o   Intel 510s (120GB) in ZFS mirrored pair
>>
>> o   Intel 520s (120GB) in ZFS mirrored pair
>>
>> o   Intel 540s (120GB) in ZFS mirrored pair
>>
>> o   Samsung 850 Pros (256GB) in ZFS mirrored pair
>>
>> *       Using bonnie++ to remove Postgres from the equation and performance does indeed drop.
>>
>> *       Rebooting server and immediately re-running test and performance is back to original.
>>
>> *       Tried using Karl Denninger's patch from PR187594 (which took some work to find a kernel that the FreeBSD10 patch would both apply and compile cleanly against).
>>
>> *       Tried disabling ZFS lz4 compression.
>>
>> *       Ran the same test on a FreeBSD9.0 amd64 system using PostgreSQL 9.1.3 with 2 Intel 520s in ZFS mirrored pair.  System had 165 days uptime and test took ~80 seconds after which I rebooted and re-ran test and was still at ~80 seconds (older processor and memory in this system).
>>
>> I realize that there's a whole lot of info I'm not including (dmesg, zfs-stats -a, gstat, et cetera): I'm hoping some enlightened individual will be able to point me to a solution with only the above to go on.
>
> Just a random guess: can you try r307264 (I am mean regression in
> r307266)?
> _______________________________________________
> freebsd-hackers at freebsd.org<mailto:freebsd-hackers at freebsd.org> mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe at freebsd.org<mailto:freebsd-hackers-unsubscribe at freebsd.org>"
>

This sounds a bit like an issue I investigated for a customer a few months ago.

Look at gstat -d (includes DELETE operations like TRIM)

If you see a lot of that happening, but try: vfs.zfs.trim.enabled=0 in /boot/loader.conf and see if your issues go away.

the FreeBSD TRIM code for ZFS basicallys waits until the sector has been free for a while (to avoid doing a TRIM on a block we'll immediately reuse), so your benchmark will run file for a little while, then suddenly the TRIM will kick in.

For postgres, fio, bonnie++ etc, make sure the ZFS dataset you are storing the data on / benchmarking has a recordsize that matches the workload.

If you are doing a write-only benchmark, and you see lots of reads in gstat, you know you are having to do read/modify/write's, and that is why your performance is so bad.

--
Allan Jude
_______________________________________________
freebsd-hackers at freebsd.org<mailto:freebsd-hackers at freebsd.org> mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscribe at freebsd.org<mailto:freebsd-hackers-unsubscribe at freebsd.org>"

This message may contain confidential and privileged information. If it has been sent to you in error, please reply to advise the sender of the error and then immediately delete it. If you are not the intended recipient, do not read, copy, disclose or otherwise use this message. The sender disclaims any liability for such unauthorized use. PLEASE NOTE that all incoming e-mails sent to Weatherford e-mail accounts will be archived and may be scanned by us and/or by external service providers to detect and prevent threats to our systems, investigate illegal or inappropriate behavior, and/or eliminate unsolicited promotional e-mails (spam). This process could result in deletion of a legitimate e-mail before it is read by its intended recipient at our organization. Moreover, based on the scanning results, the full text of e-mails and attachments may be made available to Weatherford security and other personnel for review and appropriate action. If you have any concerns about this process,
  please contact us at dataprivacy at weatherford.com<mailto:dataprivacy at weatherford.com>.
_______________________________________________
freebsd-hackers at freebsd.org<mailto:freebsd-hackers at freebsd.org> mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscribe at freebsd.org<mailto:freebsd-hackers-unsubscribe at freebsd.org>"