summaryrefslogtreecommitdiff
path: root/block/nvme.c
AgeCommit message (Collapse)Author
2021-12-09block/nvme: fix infinite loop in nvme_free_req_queue_cb()Stefan Hajnoczi
When the request free list is exhausted the coroutine waits on q->free_req_queue for the next free request. Whenever a request is completed a BH is scheduled to invoke nvme_free_req_queue_cb() and wake up waiting coroutines. 1. nvme_get_free_req() waits for a free request: while (q->free_req_head == -1) { ... trace_nvme_free_req_queue_wait(q->s, q->index); qemu_co_queue_wait(&q->free_req_queue, &q->lock); ... } 2. nvme_free_req_queue_cb() wakes up the coroutine: while (qemu_co_enter_next(&q->free_req_queue, &q->lock)) { ^--- infinite loop when free_req_head == -1 } nvme_free_req_queue_cb() and the coroutine form an infinite loop when q->free_req_head == -1. Fix this by checking q->free_req_head in nvme_free_req_queue_cb(). If the free request list is exhausted, don't wake waiting coroutines. Eventually an in-flight request will complete and the BH will be scheduled again, guaranteeing forward progress. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20211208152246.244585-1-stefanha@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2021-11-02block/nvme: Extract nvme_free_queue() from nvme_free_queue_pair()Philippe Mathieu-Daudé
Instead of duplicating code, extract the common helper to free a single queue. Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-Id: <20211006164931.172349-4-philmd@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-11-02block/nvme: Display CQ/SQ pointer in nvme_free_queue_pair()Philippe Mathieu-Daudé
For debugging purpose it is helpful to know the CQ/SQ pointers. We already have a trace event in nvme_free_queue_pair(), extend it to report these pointer addresses. Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-Id: <20211006164931.172349-3-philmd@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-11-02block/nvme: Automatically free qemu_memalign() with QEMU_AUTO_VFREEPhilippe Mathieu-Daudé
Since commit 4d324c0bf65 ("introduce QEMU_AUTO_VFREE") buffers allocated by qemu_memalign() can automatically freed when using the QEMU_AUTO_VFREE macro. Use it to simplify a bit. Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-Id: <20211006164931.172349-2-philmd@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-09-29block: use int64_t instead of int in driver discard handlersVladimir Sementsov-Ogievskiy
We are generally moving to int64_t for both offset and bytes parameters on all io paths. Main motivation is realization of 64-bit write_zeroes operation for fast zeroing large disk chunks, up to the whole disk. We chose signed type, to be consistent with off_t (which is signed) and with possibility for signed return type (where negative value means error). So, convert driver discard handlers bytes parameter to int64_t. The only caller of all updated function is bdrv_co_pdiscard in block/io.c. It is already prepared to work with 64bit requests, but pass at most max(bs->bl.max_pdiscard, INT_MAX) to the driver. Let's look at all updated functions: blkdebug: all calculations are still OK, thanks to bdrv_check_qiov_request(). both rule_check and bdrv_co_pdiscard are 64bit blklogwrites: pass to blk_loc_writes_co_log which is 64bit blkreplay, copy-on-read, filter-compress: pass to bdrv_co_pdiscard, OK copy-before-write: pass to bdrv_co_pdiscard which is 64bit and to cbw_do_copy_before_write which is 64bit file-posix: one handler calls raw_account_discard() is 64bit and both handlers calls raw_do_pdiscard(). Update raw_do_pdiscard, which pass to RawPosixAIOData::aio_nbytes, which is 64bit (and calls raw_account_discard()) gluster: somehow, third argument of glfs_discard_async is size_t. Let's set max_pdiscard accordingly. iscsi: iscsi_allocmap_set_invalid is 64bit, !is_byte_request_lun_aligned is 64bit. list.num is uint32_t. Let's clarify max_pdiscard and pdiscard_alignment. mirror_top: pass to bdrv_mirror_top_do_write() which is 64bit nbd: protocol limitation. max_pdiscard is alredy set strict enough, keep it as is for now. nvme: buf.nlb is uint32_t and we do shift. So, add corresponding limits to nvme_refresh_limits(). preallocate: pass to bdrv_co_pdiscard() which is 64bit. rbd: pass to qemu_rbd_start_co() which is 64bit. qcow2: calculations are still OK, thanks to bdrv_check_qiov_request(), qcow2_cluster_discard() is 64bit. raw-format: raw_adjust_offset() is 64bit, bdrv_co_pdiscard too. throttle: pass to bdrv_co_pdiscard() which is 64bit and to throttle_group_co_io_limits_intercept() which is 64bit as well. test-block-iothread: bytes argument is unused Great! Now all drivers are prepared to handle 64bit discard requests, or else have explicit max_pdiscard limits. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-Id: <20210903102807.27127-11-vsementsov@virtuozzo.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29block: use int64_t instead of int in driver write_zeroes handlersVladimir Sementsov-Ogievskiy
We are generally moving to int64_t for both offset and bytes parameters on all io paths. Main motivation is realization of 64-bit write_zeroes operation for fast zeroing large disk chunks, up to the whole disk. We chose signed type, to be consistent with off_t (which is signed) and with possibility for signed return type (where negative value means error). So, convert driver write_zeroes handlers bytes parameter to int64_t. The only caller of all updated function is bdrv_co_do_pwrite_zeroes(). bdrv_co_do_pwrite_zeroes() itself is of course OK with widening of callee parameter type. Also, bdrv_co_do_pwrite_zeroes()'s max_write_zeroes is limited to INT_MAX. So, updated functions all are safe, they will not get "bytes" larger than before. Still, let's look through all updated functions, and add assertions to the ones which are actually unprepared to values larger than INT_MAX. For these drivers also set explicit max_pwrite_zeroes limit. Let's go: blkdebug: calculations can't overflow, thanks to bdrv_check_qiov_request() in generic layer. rule_check() and bdrv_co_pwrite_zeroes() both have 64bit argument. blklogwrites: pass to blk_log_writes_co_log() with 64bit argument. blkreplay, copy-on-read, filter-compress: pass to bdrv_co_pwrite_zeroes() which is OK copy-before-write: Calls cbw_do_copy_before_write() and bdrv_co_pwrite_zeroes, both have 64bit argument. file-posix: both handler calls raw_do_pwrite_zeroes, which is updated. In raw_do_pwrite_zeroes() calculations are OK due to bdrv_check_qiov_request(), bytes go to RawPosixAIOData::aio_nbytes which is uint64_t. Check also where that uint64_t gets handed: handle_aiocb_write_zeroes_block() passes a uint64_t[2] to ioctl(BLKZEROOUT), handle_aiocb_write_zeroes() calls do_fallocate() which takes off_t (and we compile to always have 64-bit off_t), as does handle_aiocb_write_zeroes_unmap. All look safe. gluster: bytes go to GlusterAIOCB::size which is int64_t and to glfs_zerofill_async works with off_t. iscsi: Aha, here we deal with iscsi_writesame16_task() that has uint32_t num_blocks argument and iscsi_writesame16_task() has uint16_t argument. Make comments, add assertions and clarify max_pwrite_zeroes calculation. iscsi_allocmap_() functions already has int64_t argument is_byte_request_lun_aligned is simple to update, do it. mirror_top: pass to bdrv_mirror_top_do_write which has uint64_t argument nbd: Aha, here we have protocol limitation, and NBDRequest::len is uint32_t. max_pwrite_zeroes is cleanly set to 32bit value, so we are OK for now. nvme: Again, protocol limitation. And no inherent limit for write-zeroes at all. But from code that calculates cdw12 it's obvious that we do have limit and alignment. Let's clarify it. Also, obviously the code is not prepared to handle bytes=0. Let's handle this case too. trace events already 64bit preallocate: pass to handle_write() and bdrv_co_pwrite_zeroes(), both 64bit. rbd: pass to qemu_rbd_start_co() which is 64bit. qcow2: offset + bytes and alignment still works good (thanks to bdrv_check_qiov_request()), so tail calculation is OK qcow2_subcluster_zeroize() has 64bit argument, should be OK trace events updated qed: qed_co_request wants int nb_sectors. Also in code we have size_t used for request length which may be 32bit. So, let's just keep INT_MAX as a limit (aligning it down to pwrite_zeroes_alignment) and don't care. raw-format: Is OK. raw_adjust_offset and bdrv_co_pwrite_zeroes are both 64bit. throttle: Both throttle_group_co_io_limits_intercept() and bdrv_co_pwrite_zeroes() are 64bit. vmdk: pass to vmdk_pwritev which is 64bit quorum: pass to quorum_co_pwritev() which is 64bit Hooray! At this point all block drivers are prepared to support 64bit write-zero requests, or have explicitly set max_pwrite_zeroes. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-Id: <20210903102807.27127-8-vsementsov@virtuozzo.com> Reviewed-by: Eric Blake <eblake@redhat.com> [eblake: use <= rather than < in assertions relying on max_pwrite_zeroes] Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29block: use int64_t instead of uint64_t in driver write handlersVladimir Sementsov-Ogievskiy
We are generally moving to int64_t for both offset and bytes parameters on all io paths. Main motivation is realization of 64-bit write_zeroes operation for fast zeroing large disk chunks, up to the whole disk. We chose signed type, to be consistent with off_t (which is signed) and with possibility for signed return type (where negative value means error). So, convert driver write handlers parameters which are already 64bit to signed type. While being here, convert also flags parameter to be BdrvRequestFlags. Now let's consider all callers. Simple git grep '\->bdrv_\(aio\|co\)_pwritev\(_part\)\?' shows that's there three callers of driver function: bdrv_driver_pwritev() and bdrv_driver_pwritev_compressed() in block/io.c, both pass int64_t, checked by bdrv_check_qiov_request() to be non-negative. qcow2_save_vmstate() does bdrv_check_qiov_request(). Still, the functions may be called directly, not only by drv->... Let's check: git grep '\.bdrv_\(aio\|co\)_pwritev\(_part\)\?\s*=' | \ awk '{print $4}' | sed 's/,//' | sed 's/&//' | sort | uniq | \ while read func; do git grep "$func(" | \ grep -v "$func(BlockDriverState"; done shows several callers: qcow2: qcow2_co_truncate() write at most up to @offset, which is checked in generic qcow2_co_truncate() by bdrv_check_request(). qcow2_co_pwritev_compressed_task() pass the request (or part of the request) that already went through normal write path, so it should be OK qcow: qcow_co_pwritev_compressed() pass int64_t, it's updated by this patch quorum: quorum_co_pwrite_zeroes() pass int64_t and int - OK throttle: throttle_co_pwritev_compressed() pass int64_t, it's updated by this patch vmdk: vmdk_co_pwritev_compressed() pass int64_t, it's updated by this patch Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-Id: <20210903102807.27127-5-vsementsov@virtuozzo.com> Reviewed-by: Eric Blake <eblake@redhat.com> Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-29block: use int64_t instead of uint64_t in driver read handlersVladimir Sementsov-Ogievskiy
We are generally moving to int64_t for both offset and bytes parameters on all io paths. Main motivation is realization of 64-bit write_zeroes operation for fast zeroing large disk chunks, up to the whole disk. We chose signed type, to be consistent with off_t (which is signed) and with possibility for signed return type (where negative value means error). So, convert driver read handlers parameters which are already 64bit to signed type. While being here, convert also flags parameter to be BdrvRequestFlags. Now let's consider all callers. Simple git grep '\->bdrv_\(aio\|co\)_preadv\(_part\)\?' shows that's there three callers of driver function: bdrv_driver_preadv() in block/io.c, passes int64_t, checked by bdrv_check_qiov_request() to be non-negative. qcow2_load_vmstate() does bdrv_check_qiov_request(). do_perform_cow_read() has uint64_t argument. And a lot of things in qcow2 driver are uint64_t, so converting it is big job. But we must not work with requests that don't satisfy bdrv_check_qiov_request(), so let's just assert it here. Still, the functions may be called directly, not only by drv->... Let's check: git grep '\.bdrv_\(aio\|co\)_preadv\(_part\)\?\s*=' | \ awk '{print $4}' | sed 's/,//' | sed 's/&//' | sort | uniq | \ while read func; do git grep "$func(" | \ grep -v "$func(BlockDriverState"; done The only one such caller: QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, &data, 1); ... ret = bdrv_replace_test_co_preadv(bs, 0, 1, &qiov, 0); in tests/unit/test-bdrv-drain.c, and it's OK obviously. Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com> Message-Id: <20210903102807.27127-4-vsementsov@virtuozzo.com> Reviewed-by: Eric Blake <eblake@redhat.com> [eblake: fix typos] Signed-off-by: Eric Blake <eblake@redhat.com>
2021-09-07block/nvme: Only report VFIO error on failed retryPhilippe Mathieu-Daudé
We expect the first qemu_vfio_dma_map() to fail (indicating DMA mappings exhaustion, see commit 15a730e7a3a). Do not report the first failure as error, since we are going to flush the mappings and retry. This removes spurious error message displayed on the monitor: (qemu) c (qemu) qemu-kvm: VFIO_MAP_DMA failed: No space left on device (qemu) info status VM status: running Reported-by: Tingting Mao <timao@redhat.com> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20210902070025.197072-12-philmd@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2021-09-07util/vfio-helpers: Pass Error handle to qemu_vfio_dma_map()Philippe Mathieu-Daudé
Currently qemu_vfio_dma_map() displays errors on stderr. When using management interface, this information is simply lost. Pass qemu_vfio_dma_map() an Error** handle so it can propagate the error to callers. Reviewed-by: Fam Zheng <fam@euphon.net> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20210902070025.197072-7-philmd@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2021-09-07block/nvme: Have nvme_create_queue_pair() report errors consistentlyPhilippe Mathieu-Daudé
nvme_create_queue_pair() does not return a boolean value (indicating eventual error) but a pointer, and is inconsistent in how it fills the error handler. To fulfill callers expectations, always set an error message on failure. Reported-by: Auger Eric <eric.auger@redhat.com> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20210902070025.197072-6-philmd@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2021-07-26block/nvme: Fix VFIO_MAP_DMA failed: No space left on devicePhilippe Mathieu-Daudé
When the NVMe block driver was introduced (see commit bdd6a90a9e5, January 2018), Linux VFIO_IOMMU_MAP_DMA ioctl was only returning -ENOMEM in case of error. The driver was correctly handling the error path to recycle its volatile IOVA mappings. To fix CVE-2019-3882, Linux commit 492855939bdb ("vfio/type1: Limit DMA mappings per container", April 2019) added the -ENOSPC error to signal the user exhausted the DMA mappings available for a container. The block driver started to mis-behave: qemu-system-x86_64: VFIO_MAP_DMA failed: No space left on device (qemu) (qemu) info status VM status: paused (io-error) (qemu) c VFIO_MAP_DMA failed: No space left on device (qemu) c VFIO_MAP_DMA failed: No space left on device (The VM is not resumable from here, hence stuck.) Fix by handling the new -ENOSPC error (when DMA mappings are exhausted) without any distinction to the current -ENOMEM error, so we don't change the behavior on old kernels where the CVE-2019-3882 fix is not present. An easy way to reproduce this bug is to restrict the DMA mapping limit (65535 by default) when loading the VFIO IOMMU module: # modprobe vfio_iommu_type1 dma_entry_limit=666 Cc: qemu-stable@nongnu.org Cc: Fam Zheng <fam@euphon.net> Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Alex Williamson <alex.williamson@redhat.com> Reported-by: Michal Prívozník <mprivozn@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20210723195843.1032825-1-philmd@redhat.com Fixes: bdd6a90a9e5 ("block: Add VFIO based NVMe driver") Buglink: https://bugs.launchpad.net/qemu/+bug/1863333 Resolves: https://gitlab.com/qemu-project/qemu/-/issues/65 Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2021-02-02block/nvme: Trace NVMe spec version supported by the controllerPhilippe Mathieu-Daudé
NVMe controllers implement different versions of the spec, and different features of it. It is useful to gather this information when debugging. Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-Id: <20210127212137.3482291-3-philmd@redhat.com> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2021-02-02block/nvme: Properly display doorbell stride length in trace eventPhilippe Mathieu-Daudé
Commit 15b2260bef3 ("block/nvme: Trace controller capabilities") misunderstood the doorbell stride value from the datasheet, use the correct one. The 'doorbell_scale' variable used few lines later is correct. Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-Id: <20210127212137.3482291-2-philmd@redhat.com> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-12-18block/nvme: Implement fake truncate() coroutinePhilippe Mathieu-Daudé
NVMe drive cannot be shrunk. Since commit c80d8b06cfa we can use the @exact parameter (set to false) to return success if the block device is larger than the requested offset (even if we can not be shrunk). Use this parameter to implement the NVMe truncate() coroutine, similarly how it is done for the iscsi and file-posix drivers (see commit 82325ae5f2f "Evaluate @exact in protocol drivers"). Reported-by: Xueqiang Wei <xuwei@redhat.com> Suggested-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-Id: <20201210125202.858656-1-philmd@redhat.com> Signed-off-by: Max Reitz <mreitz@redhat.com>
2020-11-03block/nvme: Fix nvme_submit_command() on big-endian hostPhilippe Mathieu-Daudé
The Completion Queue Command Identifier is a 16-bit value, so nvme_submit_command() is unlikely to work on big-endian hosts, as the relevant bits are truncated. Fix by using the correct byte-swap function. Fixes: bdd6a90a9e5 ("block: Add VFIO based NVMe driver") Reported-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: 20201029093306.1063879-25-philmd@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com>
2020-11-03block/nvme: Fix use of write-only doorbells page on Aarch64 archPhilippe Mathieu-Daudé
qemu_vfio_pci_map_bar() calls mmap(), and mmap(2) states: 'offset' must be a multiple of the page size as returned by sysconf(_SC_PAGE_SIZE). In commit f68453237b9 we started to use an offset of 4K which broke this contract on Aarch64 arch. Fix by mapping at offset 0, and and accessing doorbells at offset=4K. Fixes: f68453237b9 ("block/nvme: Map doorbells pages write-only") Reported-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20201029093306.1063879-24-philmd@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com>
2020-11-03block/nvme: Align iov's va and size on host page sizeEric Auger
Make sure iov's va and size are properly aligned on the host page size. Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20201029093306.1063879-23-philmd@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com>
2020-11-03block/nvme: Change size and alignment of prp_list_pagesEric Auger
In preparation of 64kB host page support, let's change the size and alignment of the prp_list_pages so that the VFIO DMA MAP succeeds with 64kB host page size. We align on the host page size. Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20201029093306.1063879-22-philmd@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com>
2020-11-03block/nvme: Change size and alignment of queueEric Auger
In preparation of 64kB host page support, let's change the size and alignment of the queue so that the VFIO DMA MAP succeeds. We align on the host page size. Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20201029093306.1063879-21-philmd@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com>
2020-11-03block/nvme: Change size and alignment of IDENTIFY response bufferEric Auger
In preparation of 64kB host page support, let's change the size and alignment of the IDENTIFY command response buffer so that the VFIO DMA MAP succeeds. We align on the host page size. Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20201029093306.1063879-20-philmd@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com>
2020-11-03block/nvme: Correct minimum device page sizePhilippe Mathieu-Daudé
While trying to simplify the code using a macro, we forgot the 12-bit shift... Correct that. Fixes: fad1eb68862 ("block/nvme: Use register definitions from 'block/nvme.h'") Reported-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20201029093306.1063879-19-philmd@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com>
2020-11-03block/nvme: Set request_alignment at initializationPhilippe Mathieu-Daudé
Commit bdd6a90a9e5 ("block: Add VFIO based NVMe driver") sets the request_alignment in nvme_refresh_limits(). For consistency, also set it during initialization. Reported-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20201029093306.1063879-18-philmd@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com>
2020-11-03block/nvme: Simplify nvme_cmd_sync()Philippe Mathieu-Daudé
As all commands use the ADMIN queue, it is pointless to pass it as argument each time. Remove the argument, and rename the function as nvme_admin_cmd_sync() to make this new behavior clearer. Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: 20201029093306.1063879-17-philmd@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com>
2020-11-03block/nvme: Simplify ADMIN queue accessPhilippe Mathieu-Daudé
We don't need to dereference from BDRVNVMeState each time. Use a NVMeQueuePair pointer on the admin queue. The nvme_init() becomes easier to review, matching the style of nvme_add_io_queue(). Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20201029093306.1063879-16-philmd@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com>
2020-11-03block/nvme: Correctly initialize Admin Queue AttributesPhilippe Mathieu-Daudé
From the specification chapter 3.1.8 "AQA - Admin Queue Attributes" the Admin Submission Queue Size field is a 0’s based value: Admin Submission Queue Size (ASQS): Defines the size of the Admin Submission Queue in entries. Enabling a controller while this field is cleared to 00h produces undefined results. The minimum size of the Admin Submission Queue is two entries. The maximum size of the Admin Submission Queue is 4096 entries. This is a 0’s based value. This bug has never been hit because the device initialization uses a single command synchronously :) Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20201029093306.1063879-15-philmd@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com>
2020-11-03block/nvme: Use definitions instead of magic values in add_io_queue()Philippe Mathieu-Daudé
Replace magic values by definitions, and simplifiy since the number of queues will never reach 64K. Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20201029093306.1063879-14-philmd@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com>
2020-11-03block/nvme: Make nvme_init_queue() return boolean indicating errorPhilippe Mathieu-Daudé
Just for consistency, following the example documented since commit e3fe3988d7 ("error: Document Error API usage rules"), return a boolean value indicating an error is set or not. Directly pass errp as the local_err is not requested in our case. This simplifies a bit nvme_create_queue_pair(). Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20201029093306.1063879-12-philmd@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com>
2020-11-03block/nvme: Make nvme_identify() return boolean indicating errorPhilippe Mathieu-Daudé
Just for consistency, following the example documented since commit e3fe3988d7 ("error: Document Error API usage rules"), return a boolean value indicating an error is set or not. Directly pass errp as the local_err is not requested in our case. Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: 20201029093306.1063879-11-philmd@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com>
2020-11-03block/nvme: Use unsigned integer for queue counter/sizePhilippe Mathieu-Daudé
We can not have negative queue count/size/index, use unsigned type. Rename 'nr_queues' as 'queue_count' to match the spec naming. Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20201029093306.1063879-10-philmd@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com>
2020-11-03block/nvme: Move definitions before structure declarationsPhilippe Mathieu-Daudé
To be able to use some definitions in structure declarations, move them earlier. No logical change. Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20201029093306.1063879-9-philmd@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com>
2020-11-03block/nvme: Trace queue pair creation/deletionPhilippe Mathieu-Daudé
Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20201029093306.1063879-8-philmd@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com>
2020-11-03block/nvme: Improve nvme_free_req_queue_wait() trace informationPhilippe Mathieu-Daudé
What we want to trace is the block driver state and the queue index. Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20201029093306.1063879-7-philmd@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com>
2020-11-03block/nvme: Trace nvme_poll_queue() per queuePhilippe Mathieu-Daudé
As we want to enable multiple queues, report the event in each nvme_poll_queue() call, rather than once in the callback calling nvme_poll_queues(). Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20201029093306.1063879-6-philmd@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com>
2020-11-03block/nvme: Trace controller capabilitiesPhilippe Mathieu-Daudé
Controllers have different capabilities and report them in the CAP register. We are particularly interested by the page size limits. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20201029093306.1063879-5-philmd@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com>
2020-11-03block/nvme: Report warning with warn_report()Philippe Mathieu-Daudé
Instead of displaying warning on stderr, use warn_report() which also displays it on the monitor. Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-id: 20201029093306.1063879-4-philmd@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com>
2020-10-23block/nvme: Add driver statistics for access alignment and hw errorsPhilippe Mathieu-Daudé
Keep statistics of some hardware errors, and number of aligned/unaligned I/O accesses. QMP example booting a full RHEL 8.3 aarch64 guest: { "execute": "query-blockstats" } { "return": [ { "device": "", "node-name": "drive0", "stats": { "flush_total_time_ns": 6026948, "wr_highest_offset": 3383991230464, "wr_total_time_ns": 807450995, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 3, "wr_bytes": 50133504, "failed_unmap_operations": 0, "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 1846979900, "flush_operations": 130, "wr_operations": 659, "rd_merged": 1192, "rd_bytes": 218244096, "account_failed": false, "idle_time_ns": 2678641497, "rd_operations": 7406, }, "driver-specific": { "driver": "nvme", "completion-errors": 0, "unaligned-accesses": 2959, "aligned-accesses": 4477 }, "qdev": "/machine/peripheral-anon/device[0]/virtio-backend" } ] } Suggested-by: Stefan Hajnoczi <stefanha@gmail.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Acked-by: Markus Armbruster <armbru@redhat.com> Message-id: 20201001162939.1567915-1-philmd@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2020-10-05block/nvme: Replace magic value by SCALE_MS definitionPhilippe Mathieu-Daudé
Use self-explicit SCALE_MS definition instead of magic value (missed in similar commit e4f310fe7f5). Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-Id: <20200922083821.578519-7-philmd@redhat.com>
2020-10-05block/nvme: Use register definitions from 'block/nvme.h'Philippe Mathieu-Daudé
Use the NVMe register definitions from "block/nvme.h" which ease a bit reviewing the code while matching the datasheet. Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-Id: <20200922083821.578519-6-philmd@redhat.com>
2020-10-05block/nvme: Drop NVMeRegs structure, directly use NvmeBarPhilippe Mathieu-Daudé
NVMeRegs only contains NvmeBar. Simplify the code by using NvmeBar directly. This triggers a checkpatch.pl error: ERROR: Use of volatile is usually wrong, please add a comment #30: FILE: block/nvme.c:691: + volatile NvmeBar *regs; This is a false positive as in our case we are using I/O registers, so the 'volatile' use is justified. Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-Id: <20200922083821.578519-5-philmd@redhat.com>
2020-10-05block/nvme: Reduce I/O registers scopePhilippe Mathieu-Daudé
We only access the I/O register in nvme_init(). Remove the reference in BDRVNVMeState and reduce its scope. Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-Id: <20200922083821.578519-4-philmd@redhat.com>
2020-10-05block/nvme: Map doorbells pages write-onlyPhilippe Mathieu-Daudé
Per the datasheet sections 3.1.13/3.1.14: "The host should not read the doorbell registers." As we don't need read access, map the doorbells with write-only permission. We keep a reference to this mapped address in the BDRVNVMeState structure. Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-Id: <20200922083821.578519-3-philmd@redhat.com>
2020-10-05util/vfio-helpers: Pass page protections to qemu_vfio_pci_map_bar()Philippe Mathieu-Daudé
Pages are currently mapped READ/WRITE. To be able to use different protections, add a new argument to qemu_vfio_pci_map_bar(). Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Message-Id: <20200922083821.578519-2-philmd@redhat.com>
2020-09-10block/nvme: Pair doorbell registersPhilippe Mathieu-Daudé
For each queue doorbell registers are paired as: - Submission Queue Tail Doorbell - Completion Queue Head Doorbell Reflect that in the NVMeRegs structure, and adapt nvme_create_queue_pair() accordingly. Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-Id: <20200904124130.583838-4-philmd@redhat.com> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Reviewed-by: Fam Zheng <fam@euphon.net> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-09-10block/nvme: Use generic NvmeBar structurePhilippe Mathieu-Daudé
Commit f3c507adcd7 ("NVMe: Initial commit for new storage interface") introduced the NvmeBar structure. Unfortunately in commit bdd6a90a9e5 ("block: Add VFIO based NVMe driver") we duplicated it. Apparently in commit a3d9a352d48 ("block: Move NVMe constants to a separate header") we tried to unify headers but forgot to remove the structure declared in the block/nvme.c source file. Do it now, and remove the structure size check which is redundant with the header check added in commit 74e18435c0e ("hw/block/nvme: Align I/O BAR to 4 KiB"). Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-Id: <20200904124130.583838-3-philmd@redhat.com> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Reviewed-by: Fam Zheng <fam@euphon.net> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-09-10block/nvme: Group controller registers in NVMeRegs structurePhilippe Mathieu-Daudé
We want to use the NvmeBar structure from "block/nvme.h" in the next commit. As a preliminary step, group all the NVMe controller registers in the 'ctrl' field, keeping the doorbells registers out of it. Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-Id: <20200904124130.583838-2-philmd@redhat.com> Reviewed-by: Klaus Jensen <k.jensen@samsung.com> Reviewed-by: Fam Zheng <fam@euphon.net> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-09-07block/nvme: Use an array of EventNotifierPhilippe Mathieu-Daudé
In preparation of using multiple IRQ (thus multiple eventfds) make BDRVNVMeState::irq_notifier an array (for now of a single element, the admin queue notifier). Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-Id: <20200821195359.1285345-16-philmd@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-09-07block/nvme: Extract nvme_poll_queue()Philippe Mathieu-Daudé
As we want to do per-queue polling, extract the nvme_poll_queue() method which operates on a single queue. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-Id: <20200821195359.1285345-15-philmd@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-09-07block/nvme: Simplify nvme_create_queue_pair() argumentsPhilippe Mathieu-Daudé
nvme_create_queue_pair() doesn't require BlockDriverState anymore. Replace it by BDRVNVMeState and AioContext to simplify. Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-Id: <20200821195359.1285345-14-philmd@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
2020-09-07block/nvme: Replace BDRV_POLL_WHILE by AIO_WAIT_WHILEPhilippe Mathieu-Daudé
BDRV_POLL_WHILE() is defined as: #define BDRV_POLL_WHILE(bs, cond) ({ \ BlockDriverState *bs_ = (bs); \ AIO_WAIT_WHILE(bdrv_get_aio_context(bs_), \ cond); }) As we will remove the BlockDriverState use in the next commit, start by using the exploded version of BDRV_POLL_WHILE(). Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Message-Id: <20200821195359.1285345-13-philmd@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>