SSDs peformance on head/freebsd-10 stable using FIO

Kashyap Desai kashyap.desai at avagotech.com
Mon Jul 14 08:36:26 UTC 2014


> -----Original Message-----
> From: Alexander Motin [mailto:mavbsd at gmail.com] On Behalf Of Alexander
> Motin
> Sent: Friday, July 11, 2014 4:45 AM
> To: Kashyap Desai
> Cc: FreeBSD-scsi
> Subject: Re: SSDs peformance on head/freebsd-10 stable using FIO
>
> On 10.07.2014 16:28, Kashyap Desai wrote:
> > From: Alexander Motin [mailto:mavbsd at gmail.com] On Behalf Of
> Alexander
> >> On 10.07.2014 15:00, Kashyap Desai wrote:
> >>> I have 8 SSDs in my setup and all 8 SSDs are behind LSI’s 12Gp/s
> >>> MegaRaid Controller as JBOD. I also found FIO can be used in Async
> >>> mode after loading “aio” kernel module.
> >>>
> >>> Using single SSD, I am able to see  110K-130K IOPs.  This IOPs
> >>> counts are matching with what I see on Linux machine.
> >>>
> >>> Now, I am not able to scale IOPs on my machine after 200K.  I see
> >>> CPU is almost occupied and no idle time after IOPs reach to 200K.
> >>>
> >>> If you have any pointers to try with,  I can do some experiment on
> >>> my
> >> setup.
> >>
> >> Getting such results I would immediately start doing profiling with
> >> pmcstat.
> >> Quite likely you are hitting some new lock congestion. Start with
> >> simple `pmcstat -n 100000000 -TS unhalted-cycles`. It it hard to say
> >> for sure what went wrong there without more data, so just couple
> > I have attached profile output for the command mentioned above. I will
> > dig further and see if this is what we have theoretical limit for CAM
> > attached HBA.
>
> First thing I noticed in this profile output is bunch of TLB shutdowns.
> You can not reach reasonable performance from user-level without having
> HBA support unmapped I/O. Both mps and mpr drivers support it, but for
> some reason still not mrsas. Even at non-peak I/O rates on multi-core
> system
> TLB shutdowns in such case can eat additional 30% of CPU time.

Thanks.! For this part, I can try In mrsas. Can you help me to understand
what you mean by unmapped I/O ?

>
> Another thing I see is mentioned congestion on driver's CAM SIM lock.
> You need either multiple cards or multiqueue.
>
> >> thoughts:
> >>
> >> First of all, I've never tried aio in my benchmarks, only synchronous
> >> ones. Try to run 8 instances of `dd if=/dev/daX of=/dev/null bs=512`
> >> per each SSD same time, just as I did. You may vary number of dd's,
> >> but keep total below 256, or you mad to increase nswbuf limit in
> >> kern_vfs_bio_buffer_alloc().
> >
> > I ran multiple dd instance also and seeing IOPs throttle somewhere ~200K
> > .
> >
> > Do we have any mechanism to check CAM layer's max IOPs support
> without
> > involving actual Device ? Something like _null_ device driver which
> > just send the command back to CAM layer ?
>
> There is not such one now. Such test would radically change timings of
> operation, and I am not sure how useful would results be.
>
> >> For second, you are using single HBA, that should create significant
> >> congestion around its CAM SIM lock.  Proper solution would be to add
> >> multiple queues support to the driver, and we discussed it with Scott
> >> Long for quite some time, but that requires more work (I hope you may
> >> be interested in it ;) ). Or you may just insert 3-4 HBAs. My million
> >> IOPS I was reaching with four 2008/2308 6Gbps HBAs and 16 SATA SSDs.
> >
> > I remember this part and really good to contribute for this work.  As
> > part of this we have initiated multiple MSIx implementation in
> > <mrsas>, which will have multiple reply queue per MSI-x.
>
> Cool!
>
> > Do we really require to have multiple Submission queue at low level
> > driver
> ?
> > I thought it will be a CAM interface for multi queue which _all_ low
> > level drivers need to hook into .
>
> Now CAM is still oriented on single submission queue, but allows driver to
> have multiple completion queues. So I would start from implementing last
> ones, each bound to own MSI-X interrupt and calling completion without
> using the SIM lock or holding any other locks during the upcall.
> CAM provides way to avoid extra context switch in that case, that could be
> very useful.
>
> --
> Alexander Motin


More information about the freebsd-scsi mailing list