Whamcloud - gitweb
LU-14003 pcc: rework PCC mmap implementation 92/40092/22
authorQian Yingjin <qian@ddn.com>
Wed, 30 Sep 2020 03:00:43 +0000 (11:00 +0800)
committerOleg Drokin <green@whamcloud.com>
Mon, 8 Jul 2024 20:08:54 +0000 (20:08 +0000)
In the old PCC mmap implementation, it replaces the vm_file with
the file of the PCC copy, and then call ->fault() or
->page_mkwrite() on the PCC copy, after that restore the vm_file
with the one of the Lustre file.
This design exists problem as a mmaped region (vma) could be
faulted concurrently with multiple children threads (each children
threads can clone the VM of the parent process). There is no any
atomic guarantee for the replacement and restore the vm_file during
calling ->fault() or ->page_mkwrite().

This patch reworks the mmap() implementation for PCC.
In the new design, PCC mmap replaces the inode mapping of the PCC
copy on the PCC backend filesystem with the one of the Lustre file.
By this way, the mmaped region (vma) will link into the mapping of
the Lustre inode not the mapping of the PCC copy.
It keeps using vm_file with the file handle of the PCC copy until
the PCC cached file is detached or unmmaped.

LU-14003 pcc: convert mapping pagecache for mmap

In the PCC mmap implementation, it will replace the mapping of
the PCC copy with the one of the Lustre file when do mmap() to
make the mmapped region (vma) link into the mapping of the
Lustre file not the mapping of the PCC copy.
At this time, in the old design the pagecache in the original
mapping of the PCC copy is simply dropped as the mapping of each
page is different after the replacement of the mapping.

This may have negative impact on the mmap performance.
The reason is that during PCC attach it will write the data from
Lustre into PCC copy in buffered I/O mode, these data will keep
in pagecache and managed by the mapping of the PCC copy if there
is enough system memory. Then for the latter mmap, the page fault
could directly read data from the pagecache to speed up the mmap
operation.
If drop these pagecahe due to the different mapping of each pages,
the page fault must read page from the disk and may result in bad
performance.

To make full use of these pagecache of the PCC copy, during mmap
call, it can first remove the page from the original mapping of
the PCC copy, and then convert and add it into the mapping of the
Lustre file. By this way, all pagecaches are converted and can be
reused for the latter page fault.
Was-Change-Id: I1591937543d7d31b8811ec62088accd0070d7d37

EX-8421 llite: disable kernel readahead for pcc mmap

Set ra_pages to 0 for PCC files when mmaped, because
otherwise this setting carries through to Lustre and will
cause crashes and possible inconsistencies.  This happens
because the PCC file and Lustre file share a mapping, which
is a weird trick required to have mmap work on PCC.

Add a set of asserts which confirm kernel readahead is
disabled and wasn't used for mmap.
Was-Change-Id: I117042d68fac25158e8141c243acba698cf1930f

LU-17866 pcc: zero ra_pages explictly for a file after PCC mmap

To support mmap under PCC, we do some special magic with mmap to
allow Lustre and PCC to share the page mapping.
The mapping host (@mapping->host) for the Lustre file is replaced
with the PCC copy for mmap. This may result in the wrong setting
of @ra_pages for the Lustre file handle with the backing store of
the PCC copy in the kernel:
->do_dentry_open()->file_ra_state_init():
file_ra_state_init(struct file_ra_state *ra,
   struct address_space *mapping)
{
ra->ra_pages = inode_to_bdi(mapping->host)->ra_pages;
ra->prev_pos = -1;
}

Setting readahead pages for a file handle is the last step of the
open() call and it is not under the control inside the Lustre file
system.
Thus, to avoid setting @ra_pages wrongly we set @ra_pages with
zero for Lustre file handle explictly in all read I/O path.

When invalidate a PCC copy, we will switch back the mapping
between Lustre and PCC. We also set mapping->a_ops back with
@ll_aops.
The readahead path in PCC backend may enter the ->readpage() in
Lustre. Then we check whethter the file handle is a Lustre file
handle. If not, it should be from mmap readahead I/O path of the
PCC copy and return error code directly in this case.
Was-Change-Id: Id1e4a9e47bb484e97053759e1743fd2fce040149

Test-Parameters: clientcount=3 testlist=sanity-pcc,sanity-pcc,sanity-pcc
Signed-off-by: Qian Yingjin <qian@ddn.com>
Change-Id: Icc5019a691dfb04b5e1fdd580d83915cfe590158
Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/40092
Tested-by: jenkins <devops@whamcloud.com>
Tested-by: Maloo <maloo@whamcloud.com>
Reviewed-by: Andreas Dilger <adilger@whamcloud.com>
Reviewed-by: Li Xi <lixi@ddn.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
13 files changed:
lustre/autoconf/lustre-core.m4
lustre/include/lustre_compat.h
lustre/llite/llite_internal.h
lustre/llite/llite_lib.c
lustre/llite/llite_mmap.c
lustre/llite/pcc.c
lustre/llite/pcc.h
lustre/llite/rw.c
lustre/llite/vvp_io.c
lustre/llite/vvp_object.c
lustre/tests/Makefile.am
lustre/tests/mmap_sanity.c
lustre/tests/sanity-pcc.sh

index 30bec2c..7795adf 100644 (file)
@@ -2488,6 +2488,23 @@ AC_DEFUN([LC_PAGEVEC_INIT_ONE_PARAM], [
 ]) # LC_PAGEVEC_INIT_ONE_PARAM
 
 #
+# LC_PAGEVEC_LOOKUP_THREE_PARAM
+#
+# 4.14 pagevec_lookup takes three parameters
+#
+AC_DEFUN([LC_PAGEVEC_LOOKUP_THREE_PARAM], [
+LB_CHECK_COMPILE([if 'pagevec_lookup' takes three parameter],
+pagevec_lookup, [
+       #include <linux/pagevec.h>
+],[
+       pagevec_lookup(NULL, NULL, NULL);
+],[
+       AC_DEFINE(HAVE_PAGEVEC_LOOKUP_THREE_PARAM, 1,
+               ['pagevec_lookup' takes three parameters])
+])
+]) # LC_PAGEVEC_LOOKUP_THREE_PARAM
+
+#
 # LC_BI_BDEV
 #
 # 4.14 replaced bi_bdev to bi_disk
@@ -3750,6 +3767,30 @@ AC_DEFUN([LC_DQUOT_TRANSFER_WITH_USER_NS], [
 ]) # LC_DQUOT_TRANSFER_WITH_USER_NS
 
 #
+# LC_HAVE_FILEMAP_GET_FOLIOS
+#
+# Linux commit v5.19-rc3-342-gbe0ced5e9cb8
+#  filemap: Add filemap_get_folios()
+#
+AC_DEFUN([LC_SRC_HAVE_FILEMAP_GET_FOLIOS], [
+       LB2_LINUX_TEST_SRC([filemap_get_folios], [
+               #include <linux/pagemap.h>
+       ],[
+               struct address_space *m = NULL;
+               pgoff_t start = 0;
+               struct folio_batch *fbatch = NULL;
+               (void)filemap_get_folios(m, &start, ULONG_MAX, fbatch);
+       ],[-Werror])
+])
+AC_DEFUN([LC_HAVE_FILEMAP_GET_FOLIOS], [
+       AC_MSG_CHECKING([if filemap_get_folios() exists])
+       LB2_LINUX_TEST_RESULT([filemap_get_folios], [
+               AC_DEFINE(HAVE_FILEMAP_GET_FOLIOS, 1,
+                       [filemap_get_folios() exists])
+       ])
+]) # LC_HAVE_FILEMAP_GET_FOLIOS
+
+#
 # LC_HAVE_ADDRESS_SPACE_OPERATIONS_MIGRATE_FOLIO
 #
 # Linux commit v5.19-rc3-392-g5490da4f06d1
@@ -3947,6 +3988,19 @@ AC_DEFUN([LC_NFS_FILLDIR_USE_CTX_RETURN_BOOL], [
 ]) # LC_NFS_FILLDIR_USE_CTX_RETURN_BOOL
 
 #
+# LC_HAVE_ADD_TO_PAGE_CACHE_LOCKED
+#
+# Linux version v6.0 commit: 2bb876b58d593d7f2522ec0f41f20a74fde76822
+# filemap: Remove add_to_page_cache() and add_to_page_cache_locked()
+# add_to_page_cache_locked() no longer exported.
+#
+AC_DEFUN([LC_HAVE_ADD_TO_PAGE_CACHE_LOCKED], [
+LB_CHECK_EXPORT([add_to_page_cache_locked], [mm/filemap.c],
+       [AC_DEFINE(HAVE_ADD_TO_PAGE_CACHE_LOCKED, 1,
+                       [add_to_page_cache_locked is exported by the kernel])])
+]) # LC_HAVE_ADD_TO_PAGE_CACHE_LOCKED
+
+#
 # LC_HAVE_FILEMAP_GET_FOLIOS_CONTIG
 #
 # Linux commit v6.0-rc3-94-g35b471467f88
@@ -4787,6 +4841,7 @@ AC_DEFUN([LC_PROG_LINUX_SRC], [
        LC_SRC_HAVE_ADDRESS_SPACE_OPERATIONS_RELEASE_FOLIO
        LC_SRC_HAVE_LSMCONTEXT_INIT
        LC_SRC_SECURITY_DENTRY_INIT_SECURTY_WITH_CTX
+       LC_SRC_HAVE_FILEMAP_GET_FOLIOS
 
        # 6.0
        LC_SRC_HAVE_NO_LLSEEK
@@ -4796,6 +4851,7 @@ AC_DEFUN([LC_PROG_LINUX_SRC], [
        LC_SRC_HAVE_VFS_SETXATTR_NON_CONST_VALUE
        LC_SRC_HAVE_IOV_ITER_GET_PAGES_ALLOC2
        LC_SRC_HAVE_USER_BACKED_ITER
+       LC_HAVE_ADD_TO_PAGE_CACHE_LOCKED
 
        # 6.1
        LC_SRC_HAVE_GET_RANDOM_U32_AND_U64
@@ -5006,6 +5062,7 @@ AC_DEFUN([LC_PROG_LINUX_RESULTS], [
 
        # 4.14
        LC_PAGEVEC_INIT_ONE_PARAM
+       LC_PAGEVEC_LOOKUP_THREE_PARAM
        LC_BI_BDEV
        LC_INTERVAL_TREE_CACHED
 
@@ -5100,6 +5157,7 @@ AC_DEFUN([LC_PROG_LINUX_RESULTS], [
        LC_HAVE_ADDRESS_SPACE_OPERATIONS_RELEASE_FOLIO
        LC_HAVE_LSMCONTEXT_INIT
        LC_SECURITY_DENTRY_INIT_SECURTY_WITH_CTX
+       LC_HAVE_FILEMAP_GET_FOLIOS
 
        # 6.0
        LC_HAVE_NO_LLSEEK
index f9140a7..2ac6855 100644 (file)
@@ -897,8 +897,10 @@ static inline struct page *ll_read_cache_page(struct address_space *mapping,
 #endif /* HAVE_READ_CACHE_PAGE_WANTS_FILE */
 }
 
-#ifdef HAVE_FOLIO_BATCH
+#if defined(HAVE_FOLIO_BATCH) && defined(HAVE_FILEMAP_GET_FOLIOS)
 # define ll_folio_batch_init(batch, n) folio_batch_init(batch)
+# define ll_filemap_get_folios(m, s, e, fbatch) \
+        filemap_get_folios(m, &s, e, fbatch)
 # define fbatch_at(fbatch, f)          ((fbatch)->folios[(f)])
 # define fbatch_at_npgs(fbatch, f)     folio_nr_pages((fbatch)->folios[(f)])
 # define fbatch_at_pg(fbatch, f, pg)   folio_page((fbatch)->folios[(f)], (pg))
@@ -911,7 +913,7 @@ static inline void folio_batch_reinit(struct folio_batch *fbatch)
 }
 # endif /* HAVE_FOLIO_BATCH_REINIT */
 
-#else /* !HAVE_FOLIO_BATCH */
+#else /* !HAVE_FOLIO_BATCH && !HAVE_FILEMAP_GET_FOLIOS */
 
 # ifdef HAVE_PAGEVEC
 #  define folio_batch                  pagevec
@@ -929,10 +931,17 @@ static inline void folio_batch_reinit(struct folio_batch *fbatch)
 # else
 #  define ll_folio_batch_init(pvec, n) pagevec_init(pvec, n)
 # endif
+#ifdef HAVE_PAGEVEC_LOOKUP_THREE_PARAM
+# define ll_filemap_get_folios(m, s, e, pvec) \
+        pagevec_lookup(pvec, m, &s)
+#else
+# define ll_filemap_get_folios(m, s, e, pvec) \
+        pagevec_lookup(pvec, m, s, PAGEVEC_SIZE)
+#endif
 # define fbatch_at(pvec, n)            ((pvec)->pages[(n)])
 # define fbatch_at_npgs(pvec, n)       1
 # define fbatch_at_pg(pvec, n, pg)     ((pvec)->pages[(n)])
-#endif /* HAVE_FOLIO_BATCH */
+#endif /* HAVE_FOLIO_BATCH && HAVE_FILEMAP_GET_FOLIOS */
 
 #ifndef HAVE_FLUSH___WORKQUEUE
 #define __flush_workqueue(wq)  flush_scheduled_work()
index 382ef1a..f78c4a2 100644 (file)
@@ -279,6 +279,8 @@ struct ll_inode_info {
 
                        struct mutex             lli_pcc_lock;
                        enum lu_pcc_state_flags  lli_pcc_state;
+                       atomic_t                 lli_pcc_mapcnt;
+
                        /*
                         * @lli_pcc_generation saves the gobal PCC generation
                         * when the file was successfully attached into PCC.
@@ -2073,6 +2075,11 @@ static inline struct pcc_super *ll_info2pccs(struct ll_inode_info *lli)
        return ll_i2pccs(ll_info2i(lli));
 }
 
+static inline struct pcc_file *ll_file2pccf(struct file *file)
+{
+       return &((struct ll_file_data *)file->private_data)->fd_pcc_file;
+}
+
 /* crypto.c */
 /* The digested form is made of a FID (16 bytes) followed by the second-to-last
  * ciphertext block (16 bytes), so a total length of 32 bytes.
index 5f380b8..30b7658 100644 (file)
@@ -1300,6 +1300,7 @@ void ll_lli_init(struct ll_inode_info *lli)
                lli->lli_pcc_inode = NULL;
                lli->lli_pcc_dsflags = PCC_DATASET_INVALID;
                lli->lli_pcc_generation = 0;
+               atomic_set(&lli->lli_pcc_mapcnt, 0);
                mutex_init(&lli->lli_group_mutex);
                lli->lli_group_users = 0;
                lli->lli_group_gid = 0;
index 94d330a..ee7ea29 100644 (file)
@@ -398,7 +398,7 @@ static vm_fault_t ll_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 
        result = pcc_fault(vma, vmf, &cached);
        if (cached)
-               goto out;
+               return result;
 
        CDEBUG(D_MMAP|D_IOTRACE,
               "START file %s:"DFID", vma=%p start=%#lx end=%#lx vm_flags=%#lx idx=%lu\n",
@@ -450,7 +450,6 @@ restart:
        }
        sigprocmask(SIG_SETMASK, &old, NULL);
 
-out:
        if (vmf->page && result == VM_FAULT_LOCKED) {
                ll_rw_stats_tally(ll_i2sbi(file_inode(vma->vm_file)),
                                  current->pid, vma->vm_file->private_data,
@@ -494,7 +493,7 @@ static vm_fault_t ll_page_mkwrite(struct vm_area_struct *vma,
 
        result = pcc_page_mkwrite(vma, vmf, &cached);
        if (cached)
-               goto out;
+               return result;
 
        file_update_time(vma->vm_file);
        do {
@@ -531,7 +530,6 @@ static vm_fault_t ll_page_mkwrite(struct vm_area_struct *vma,
                break;
        }
 
-out:
        if (result == VM_FAULT_LOCKED) {
                ll_rw_stats_tally(ll_i2sbi(file_inode(vma->vm_file)),
                                  current->pid, vma->vm_file->private_data,
@@ -556,13 +554,18 @@ out:
  */
 static void ll_vm_open(struct vm_area_struct *vma)
 {
-       struct inode *inode    = file_inode(vma->vm_file);
-       struct vvp_object *vob = cl_inode2vvp(inode);
-
        ENTRY;
-       LASSERT(atomic_read(&vob->vob_mmap_cnt) >= 0);
-       atomic_inc(&vob->vob_mmap_cnt);
-       pcc_vm_open(vma);
+
+       if (vma->vm_private_data == NULL) {
+               struct inode *inode = file_inode(vma->vm_file);
+               struct vvp_object *vob = cl_inode2vvp(inode);
+
+               LASSERT(atomic_read(&vob->vob_mmap_cnt) >= 0);
+               atomic_inc(&vob->vob_mmap_cnt);
+       } else {
+               pcc_vm_open(vma);
+       }
+
        EXIT;
 }
 
@@ -571,13 +574,18 @@ static void ll_vm_open(struct vm_area_struct *vma)
  */
 static void ll_vm_close(struct vm_area_struct *vma)
 {
-       struct inode      *inode = file_inode(vma->vm_file);
-       struct vvp_object *vob   = cl_inode2vvp(inode);
-
        ENTRY;
-       atomic_dec(&vob->vob_mmap_cnt);
-       LASSERT(atomic_read(&vob->vob_mmap_cnt) >= 0);
-       pcc_vm_close(vma);
+
+       if (vma->vm_private_data == NULL) {
+               struct inode *inode = file_inode(vma->vm_file);
+               struct vvp_object *vob = cl_inode2vvp(inode);
+
+               atomic_dec(&vob->vob_mmap_cnt);
+               LASSERT(atomic_read(&vob->vob_mmap_cnt) >= 0);
+       } else {
+               pcc_vm_close(vma);
+       }
+
        EXIT;
 }
 
index c0299b8..eb09814 100644 (file)
@@ -595,6 +595,7 @@ pcc_parse_value_pair(struct pcc_cmd *cmd, char *buffer)
 {
        char *key, *val;
        unsigned long id;
+       bool enable;
        int rc;
 
        val = buffer;
@@ -653,6 +654,18 @@ pcc_parse_value_pair(struct pcc_cmd *cmd, char *buffer)
                        return rc;
                if (id > 0)
                        cmd->u.pccc_add.pccc_flags |= PCC_DATASET_PCCRO;
+       } else if (strcmp(key, "mmap_conv") == 0) {
+               rc = kstrtobool(val, &enable);
+               if (rc)
+                       return rc;
+               if (enable)
+#ifdef HAVE_ADD_TO_PAGE_CACHE_LOCKED
+                       cmd->u.pccc_add.pccc_flags |= PCC_DATASET_MMAP_CONV;
+#else
+                       CWARN("mmap convert is not supported, ignored it.\n");
+#endif
+               else
+                       cmd->u.pccc_add.pccc_flags &= ~PCC_DATASET_MMAP_CONV;
        } else if (strcmp(key, "hsmtool") == 0) {
                cmd->u.pccc_add.pccc_hsmtool_type = hsmtool_string2type(val);
                if (cmd->u.pccc_add.pccc_hsmtool_type != HSMTOOL_POSIX_V1 &&
@@ -1277,8 +1290,13 @@ static void pcc_inode_init(struct pcc_inode *pcci, struct ll_inode_info *lli)
 
 static void pcc_inode_fini(struct pcc_inode *pcci)
 {
+       struct inode *pcc_inode = pcci->pcci_path.dentry->d_inode;
        struct ll_inode_info *lli = pcci->pcci_lli;
 
+       /* The PCC file was once mmaped? */
+       if (pcc_inode && pcc_inode->i_mapping != &pcc_inode->i_data)
+               pcc_inode->i_mapping = &pcc_inode->i_data;
+
        path_put(&pcci->pcci_path);
        pcci->pcci_type = LU_PCC_NONE;
        OBD_SLAB_FREE_PTR(pcci, pcc_inode_slab);
@@ -1707,6 +1725,11 @@ static int pcc_try_readonly_open_attach(struct inode *inode, struct file *file,
                               PFID(&ll_i2info(inode)->lli_fid), rc);
                        /* ignore the error during auto PCC-RO attach. */
                        rc = 0;
+               } else {
+                       CDEBUG(D_CACHE,
+                              "PCC-RO attach %pd "DFID" with size %llu\n",
+                              dentry, PFID(ll_inode2fid(inode)),
+                              i_size_read(inode));
                }
        }
 
@@ -1813,93 +1836,130 @@ static inline bool pcc_may_auto_attach(struct inode *inode,
        RETURN(lli->lli_pcc_dsflags & PCC_DATASET_IO_ATTACH);
 }
 
-int pcc_file_open(struct inode *inode, struct file *file)
+static void __pcc_layout_invalidate(struct pcc_inode *pcci)
 {
-       struct pcc_inode *pcci;
-       struct ll_inode_info *lli = ll_i2info(inode);
-       struct ll_file_data *fd = file->private_data;
-       struct pcc_file *pccf = &fd->fd_pcc_file;
-       struct file *pcc_file;
-       struct path *path;
-       bool cached = false;
-       int rc = 0;
-
-       ENTRY;
-
-       if (!S_ISREG(inode->i_mode))
-               RETURN(0);
+       pcci->pcci_type = LU_PCC_NONE;
+       pcc_layout_gen_set(pcci, CL_LAYOUT_GEN_NONE);
+       if (atomic_read(&pcci->pcci_active_ios) == 0)
+               return;
 
-       if (IS_ENCRYPTED(inode))
-               RETURN(0);
+       CDEBUG(D_CACHE, "Waiting for IO completion: %d\n",
+                      atomic_read(&pcci->pcci_active_ios));
+       wait_event_idle(pcci->pcci_waitq,
+                       atomic_read(&pcci->pcci_active_ios) == 0);
+}
 
+static inline void pcc_inode_mmap_get(struct inode *inode)
+{
        pcc_inode_lock(inode);
-       pcci = ll_i2pcci(inode);
+       atomic_inc(&ll_i2info(inode)->lli_pcc_mapcnt);
+       pcc_inode_unlock(inode);
+}
 
-       if (lli->lli_pcc_state & PCC_STATE_FL_ATTACHING)
-               GOTO(out_unlock, rc = 0);
+static inline void pcc_inode_mapping_reset(struct inode *inode)
+{
+       struct pcc_inode *pcci = ll_i2pcci(inode);
+       struct inode *pcc_inode = pcci->pcci_path.dentry->d_inode;
+       struct address_space *mapping = inode->i_mapping;
+       int rc;
 
-       if (!pcci || !pcc_inode_has_layout(pcci)) {
-               if (pcc_may_auto_attach(inode, PIT_OPEN))
-                       rc = pcc_try_auto_attach(inode, &cached, PIT_OPEN);
+       /* Did we mmap this file? */
+       if (pcc_inode->i_mapping == &pcc_inode->i_data)
+               return;
 
-               if (rc == 0 && !cached)
-                       rc = pcc_try_readonly_open_attach(inode, file, &cached);
+       LASSERT(mapping == pcc_inode->i_mapping && mapping->host == pcc_inode);
 
-               if (rc < 0 || !cached)
-                       GOTO(out_unlock, rc);
+       /*
+        * FIXME: As PCC mmap replaces the inode mapping of the PCC copy on the
+        * PCC backend filesystem with the one of the Lustre file, it may
+        * contain some vmas from the users (i.e. root) directly do mmap on the
+        * file under the PCC backend filesystem. At this time, the mapping may
+        * contain vmas from both Lustre users and users directly performed mmap
+        * on PCC backend filesystem.
+        * Thus, It needs a mechanism to forbid users to access the PCC copy
+        * directly from the user space and the PCC copy can only be accessed
+        * from Lustre PCC hook.
+        * One solution is to use flock() to lock the PCC copy when the file
+        * is once attached into PCC and unlock it when the file is detached
+        * from PCC. By this way, the PCC copy is blocking on access from user
+        * space directly when it is valid cached on PCC.
+        */
 
-               if (!pcci)
-                       pcci = ll_i2pcci(inode);
-       }
+       if (pcc_inode_has_layout(pcci))
+               return;
 
-       pcc_inode_get(pcci);
-       WARN_ON(pccf->pccf_file);
+       /*
+        * The file is detaching, firstly write out all dirty pages and then
+        * unmap and remove all pagecache associated with the PCC backend.
+        */
+       rc = filemap_write_and_wait_range(mapping, 0, LUSTRE_EOF);
+       if (rc)
+               CWARN("%s: Failed to write out data for file fid="DFID"\n",
+                     ll_i2sbi(inode)->ll_fsname, PFID(ll_inode2fid(inode)));
 
-       path = &pcci->pcci_path;
-       CDEBUG(D_CACHE, "opening pcc file '%pd'\n", path->dentry);
+       truncate_pagecache_range(inode, 0, LUSTRE_EOF);
+       mapping->a_ops = &ll_aops;
+       /*
+        * Please note the mapping host (@mapping->host) for the Lustre file is
+        * replaced with the PCC copy in case of mmap() on the PCC cached file.
+        * This may result in the setting of @ra_pages of the Lustre file
+        * handle with the one of the PCC copy wrongly in the kernel:
+        * ->do_dentry_open()->file_ra_state_init()
+        * And this is the last step of the open() call and is not under the
+        * control inside the Lustre file system.
+        * Thus to avoid the setting of @ra_pages wrongly we set @ra_pages with
+        * zero explictly in all read I/O path.
+        */
+       mapping->host = inode;
+       pcc_inode->i_mapping = &pcc_inode->i_data;
 
-       pcc_file = dentry_open(path, file->f_flags,
-                              pcc_super_cred(inode->i_sb));
-       if (IS_ERR_OR_NULL(pcc_file)) {
-               rc = pcc_file == NULL ? -EINVAL : PTR_ERR(pcc_file);
-               pcc_inode_put(pcci);
-       } else {
-               pccf->pccf_file = pcc_file;
-               pccf->pccf_type = pcci->pcci_type;
-       }
+       CDEBUG(D_CACHE, "Reset mapping for inode %p fid="DFID" mapping %p\n",
+              inode, PFID(ll_inode2fid(inode)), inode->i_mapping);
+}
 
-out_unlock:
+static inline void pcc_inode_mmap_put(struct inode *inode)
+{
+       pcc_inode_lock(inode);
+       if (atomic_dec_and_test(&ll_i2info(inode)->lli_pcc_mapcnt))
+               pcc_inode_mapping_reset(inode);
        pcc_inode_unlock(inode);
-       RETURN(rc);
 }
 
-void pcc_file_release(struct inode *inode, struct file *file)
+/* Call with inode lock held. */
+static inline void pcc_inode_detach(struct inode *inode)
+{
+       __pcc_layout_invalidate(ll_i2pcci(inode));
+       pcc_inode_mapping_reset(inode);
+}
+
+static inline void pcc_inode_detach_put(struct inode *inode)
+{
+       struct pcc_inode *pcci = ll_i2pcci(inode);
+
+       pcc_inode_detach(inode);
+       LASSERT(pcci != NULL);
+       pcc_inode_put(pcci);
+}
+
+void pcc_layout_invalidate(struct inode *inode)
 {
        struct pcc_inode *pcci;
-       struct ll_file_data *fd = file->private_data;
-       struct pcc_file *pccf;
-       struct path *path;
 
        ENTRY;
 
-       if (!S_ISREG(inode->i_mode) || fd == NULL)
-               RETURN_EXIT;
-
-       pccf = &fd->fd_pcc_file;
        pcc_inode_lock(inode);
-       if (pccf->pccf_file == NULL)
-               goto out;
-
        pcci = ll_i2pcci(inode);
-       LASSERT(pcci);
-       path = &pcci->pcci_path;
-       CDEBUG(D_CACHE, "releasing pcc file \"%pd\"\n", path->dentry);
-       pcc_inode_put(pcci);
-       fput(pccf->pccf_file);
-       pccf->pccf_file = NULL;
-out:
+       if (pcci && pcc_inode_has_layout(pcci)) {
+               LASSERT(atomic_read(&pcci->pcci_refcount) > 0);
+
+               CDEBUG(D_CACHE, "Invalidate "DFID" layout gen %d\n",
+                      PFID(&ll_i2info(inode)->lli_fid), pcci->pcci_layout_gen);
+
+               pcc_inode_detach_put(inode);
+       }
        pcc_inode_unlock(inode);
-       RETURN_EXIT;
+
+       EXIT;
 }
 
 /* Tolerate the IO failure on PCC and fall back to normal Lustre IO path */
@@ -1920,7 +1980,7 @@ static bool pcc_io_tolerate(struct pcc_inode *pcci,
                 */
                if ((iot == PIT_READ || iot == PIT_GETATTR ||
                     iot == PIT_SPLICE_READ) && rc < 0 && rc != -ENOMEM &&
-                    rc != -EAGAIN && rc != -EIOCBQUEUED)
+                   rc != -EAGAIN && rc != -EIOCBQUEUED)
                        return false;
                if (iot == PIT_FAULT && (rc & VM_FAULT_SIGBUS) &&
                    !(rc & VM_FAULT_OOM))
@@ -1930,7 +1990,8 @@ static bool pcc_io_tolerate(struct pcc_inode *pcci,
        return true;
 }
 
-static void pcc_io_init(struct inode *inode, enum pcc_io_type iot, bool *cached)
+static void pcc_io_init(struct inode *inode, enum pcc_io_type iot,
+                       bool *cached)
 {
        struct pcc_inode *pcci;
 
@@ -1939,22 +2000,20 @@ static void pcc_io_init(struct inode *inode, enum pcc_io_type iot, bool *cached)
        if (pcci && pcc_inode_has_layout(pcci)) {
                LASSERT(atomic_read(&pcci->pcci_refcount) > 0);
                if (pcci->pcci_type == LU_PCC_READONLY &&
-                   (iot == PIT_WRITE || iot == PIT_SETATTR ||
-                    iot == PIT_PAGE_MKWRITE)) {
-                       /* Fall back to normal I/O path */
+                   (iot == PIT_WRITE || iot == PIT_SETATTR)) {
+                       /* Detach from PCC. Fall back to normal I/O path */
                        *cached = false;
-                       /* For mmap write, we need to detach the file from
-                        * RO-PCC, release the page got from ->fault(), and
-                        * then retry the memory fault handling (->fault()
-                        * and ->page_mkwrite()).
-                        * These are done in pcc_page_mkwrite();
-                        */
+                       pcc_inode_detach_put(inode);
                } else {
                        atomic_inc(&pcci->pcci_active_ios);
                        *cached = true;
                }
        } else {
                *cached = false;
+               /*
+                * FIXME: Forbid auto PCC attach if the file has still been
+                * mmapped in PCC.
+                */
                if (pcc_may_auto_attach(inode, iot)) {
                        (void) pcc_try_auto_attach(inode, cached, iot);
                        if (*cached) {
@@ -1979,6 +2038,99 @@ static void pcc_io_fini(struct inode *inode, enum pcc_io_type iot,
                wake_up_all(&pcci->pcci_waitq);
 }
 
+int pcc_file_open(struct inode *inode, struct file *file)
+{
+       struct pcc_inode *pcci;
+       struct ll_inode_info *lli = ll_i2info(inode);
+       struct ll_file_data *fd = file->private_data;
+       struct pcc_file *pccf = &fd->fd_pcc_file;
+       struct file *pcc_file;
+       struct path *path;
+       bool cached = false;
+       int rc = 0;
+
+       ENTRY;
+
+       if (!S_ISREG(inode->i_mode))
+               RETURN(0);
+
+       if (IS_ENCRYPTED(inode))
+               RETURN(0);
+
+       pcc_inode_lock(inode);
+       pcci = ll_i2pcci(inode);
+
+       if (lli->lli_pcc_state & PCC_STATE_FL_ATTACHING)
+               GOTO(out_unlock, rc = 0);
+
+       if (!pcci || !pcc_inode_has_layout(pcci)) {
+               if (pcc_may_auto_attach(inode, PIT_OPEN))
+                       rc = pcc_try_auto_attach(inode, &cached, PIT_OPEN);
+
+               if (rc == 0 && !cached)
+                       rc = pcc_try_readonly_open_attach(inode, file, &cached);
+
+               if (rc < 0 || !cached)
+                       GOTO(out_unlock, rc);
+
+               if (!pcci)
+                       pcci = ll_i2pcci(inode);
+       }
+
+       pcc_inode_get(pcci);
+       WARN_ON(pccf->pccf_file);
+
+       path = &pcci->pcci_path;
+       CDEBUG(D_CACHE, "opening pcc file '%pd' - %pd\n",
+              path->dentry, file->f_path.dentry);
+
+       pcc_file = dentry_open(path, file->f_flags,
+                              pcc_super_cred(inode->i_sb));
+       if (IS_ERR_OR_NULL(pcc_file)) {
+               rc = pcc_file == NULL ? -EINVAL : PTR_ERR(pcc_file);
+               pcc_inode_put(pcci);
+       } else {
+               pccf->pccf_file = pcc_file;
+               pccf->pccf_type = pcci->pcci_type;
+       }
+
+out_unlock:
+       pcc_inode_unlock(inode);
+       RETURN(rc);
+}
+
+void pcc_file_release(struct inode *inode, struct file *file)
+{
+       struct pcc_inode *pcci;
+       struct ll_file_data *fd = file->private_data;
+       struct pcc_file *pccf;
+       struct path *path;
+
+       ENTRY;
+
+       if (!S_ISREG(inode->i_mode) || fd == NULL)
+               RETURN_EXIT;
+
+       pccf = &fd->fd_pcc_file;
+       pcc_inode_lock(inode);
+       if (pccf->pccf_file == NULL)
+               goto out;
+
+       pcci = ll_i2pcci(inode);
+       LASSERT(pcci);
+       path = &pcci->pcci_path;
+       CDEBUG(D_CACHE, "releasing pcc file \"%pd\"\n", path->dentry);
+       pcc_inode_put(pcci);
+
+       LASSERT(file_count(pccf->pccf_file) > 0);
+       fput(pccf->pccf_file);
+       pccf->pccf_file = NULL;
+
+out:
+       pcc_inode_unlock(inode);
+       RETURN_EXIT;
+}
+
 static ssize_t
 __pcc_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
 {
@@ -2018,13 +2170,13 @@ ssize_t pcc_file_read_iter(struct kiocb *iocb,
                           struct iov_iter *iter, bool *cached)
 {
        struct file *file = iocb->ki_filp;
-       struct ll_file_data *fd = file->private_data;
-       struct pcc_file *pccf = &fd->fd_pcc_file;
        struct inode *inode = file_inode(file);
+       struct pcc_file *pccf = ll_file2pccf(file);
        ssize_t result;
 
        ENTRY;
 
+       file->f_ra.ra_pages = 0;
        if (pccf->pccf_file == NULL) {
                *cached = false;
                RETURN(0);
@@ -2089,9 +2241,8 @@ ssize_t pcc_file_write_iter(struct kiocb *iocb,
                            struct iov_iter *iter, bool *cached)
 {
        struct file *file = iocb->ki_filp;
-       struct ll_file_data *fd = file->private_data;
-       struct pcc_file *pccf = &fd->fd_pcc_file;
        struct inode *inode = file_inode(file);
+       struct pcc_file *pccf = ll_file2pccf(file);
        ssize_t result;
 
        ENTRY;
@@ -2101,11 +2252,6 @@ ssize_t pcc_file_write_iter(struct kiocb *iocb,
                RETURN(0);
        }
 
-       if (pccf->pccf_type != LU_PCC_READWRITE) {
-               *cached = false;
-               RETURN(-EAGAIN);
-       }
-
        pcc_io_init(inode, PIT_WRITE, cached);
        if (!*cached)
                RETURN(0);
@@ -2235,13 +2381,13 @@ ssize_t pcc_file_splice_read(struct file *in_file, loff_t *ppos,
                             size_t count, unsigned int flags)
 {
        struct inode *inode = file_inode(in_file);
-       struct ll_file_data *fd = in_file->private_data;
-       struct file *pcc_file = fd->fd_pcc_file.pccf_file;
+       struct file *pcc_file = ll_file2pccf(in_file)->pccf_file;
        bool cached = false;
        ssize_t result;
 
        ENTRY;
 
+       in_file->f_ra.ra_pages = 0;
        if (!pcc_file)
                RETURN(default_file_splice_read(in_file, ppos, pipe,
                                                count, flags));
@@ -2262,8 +2408,7 @@ int pcc_fsync(struct file *file, loff_t start, loff_t end,
              int datasync, bool *cached)
 {
        struct inode *inode = file_inode(file);
-       struct ll_file_data *fd = file->private_data;
-       struct pcc_file *pccf = &fd->fd_pcc_file;
+       struct pcc_file *pccf = ll_file2pccf(file);
        struct file *pcc_file = pccf->pccf_file;
        int rc;
 
@@ -2287,7 +2432,7 @@ int pcc_fsync(struct file *file, loff_t start, loff_t end,
         */
        if (pccf->pccf_type == LU_PCC_READONLY) {
                *cached = false;
-               RETURN(-EAGAIN);
+               RETURN(0);
        }
 
        pcc_io_init(inode, PIT_FSYNC, cached);
@@ -2301,35 +2446,257 @@ int pcc_fsync(struct file *file, loff_t start, loff_t end,
        RETURN(rc);
 }
 
+static inline void pcc_vma_file_reset(struct inode *inode,
+                                     struct vm_area_struct *vma)
+{
+       struct pcc_vma *pccv = (struct pcc_vma *)vma->vm_private_data;
+
+       LASSERT(pccv);
+       if (vma->vm_file != pccv->pccv_file) {
+               struct pcc_file *pccf = ll_file2pccf(pccv->pccv_file);
+
+               LASSERT(vma->vm_file == pccf->pccf_file);
+               LASSERT(vma->vm_file->f_mapping == inode->i_mapping);
+               vma->vm_file = pccv->pccv_file;
+
+               get_file(vma->vm_file);
+               fput(pccf->pccf_file);
+
+               CDEBUG(D_CACHE,
+                      DFID" mapcnt %d vm_file %p:%ld lu_file %p:%ld vma %p\n",
+                      PFID(ll_inode2fid(inode)),
+                      atomic_read(&ll_i2info(inode)->lli_pcc_mapcnt),
+                      vma->vm_file, file_count(vma->vm_file), pccv->pccv_file,
+                      file_count(pccv->pccv_file), vma);
+       }
+}
+
+static void pcc_mmap_vma_reset(struct inode *inode, struct vm_area_struct *vma)
+{
+       pcc_inode_lock(inode);
+       pcc_vma_file_reset(inode, vma);
+       pcc_inode_unlock(inode);
+}
+
+static void pcc_mmap_io_init(struct inode *inode, enum pcc_io_type iot,
+                            struct vm_area_struct *vma, bool *cached)
+{
+       struct pcc_vma *pccv = (struct pcc_vma *)vma->vm_private_data;
+       struct pcc_inode *pcci;
+
+       LASSERT(pccv);
+
+       pcc_inode_lock(inode);
+       pcci = ll_i2pcci(inode);
+       if (pcci && pcc_inode_has_layout(pcci)) {
+               LASSERT(atomic_read(&pcci->pcci_refcount) > 0);
+               if (pcci->pcci_type == LU_PCC_READONLY &&
+                   iot == PIT_PAGE_MKWRITE) {
+                       pcc_inode_detach_put(inode);
+                       pcc_vma_file_reset(inode, vma);
+                       *cached = false;
+               } else {
+                       atomic_inc(&pcci->pcci_active_ios);
+                       *cached = true;
+               }
+       } else {
+               *cached = false;
+               pcc_vma_file_reset(inode, vma);
+       }
+       pcc_inode_unlock(inode);
+}
+
+static int pcc_mmap_pages_convert(struct inode *inode,
+                                 struct inode *pcc_inode)
+{
+#ifdef HAVE_ADD_TO_PAGE_CACHE_LOCKED
+       struct folio_batch fbatch;
+       pgoff_t index = 0;
+       unsigned nr;
+       int rc = 0;
+
+       ll_folio_batch_init(&fbatch, 0);
+       for ( ; ; ) {
+               struct page *page;
+               int i;
+
+               nr = ll_filemap_get_folios(pcc_inode->i_mapping,
+                                          index, ~0UL, &fbatch);
+               if (nr == 0)
+                       break;
+
+               for (i = 0; i < nr; i++) {
+#if defined(HAVE_FOLIO_BATCH) && defined(HAVE_FILEMAP_GET_FOLIOS)
+                       page = &fbatch.folios[i]->page;
+#else
+                       page = fbatch.pages[i];
+#endif
+                       lock_page(page);
+                       wait_on_page_writeback(page);
+
+                       /*
+                        * FIXME: Special handling for shadow or DAX entries.
+                        * i.e. the PCC backend FS is using DAX access
+                        * (ext4-dax) for performance reason on the NVMe
+                        * hardware.
+                        */
+                       /* Remove the page from the mapping of the PCC copy. */
+                       cfs_delete_from_page_cache(page);
+                       /* Add the page into the mapping of the Lustre file. */
+                       rc = add_to_page_cache_locked(page, inode->i_mapping,
+                                                     page->index, GFP_KERNEL);
+                       if (rc) {
+                               unlock_page(page);
+                               folio_batch_release(&fbatch);
+                               return rc;
+                       }
+
+                       unlock_page(page);
+               }
+
+               index = page->index + 1;
+               folio_batch_release(&fbatch);
+               cond_resched();
+       }
+
+       return rc;
+#else
+       return 0;
+#endif /* HAVE_ADD_TO_PAGE_CACHE_LOCKED */
+}
+
+static int pcc_mmap_mapping_set(struct inode *inode, struct inode *pcc_inode)
+{
+       struct address_space *mapping = inode->i_mapping;
+       struct pcc_inode *pcci = ll_i2pcci(inode);
+       int rc;
+
+       ENTRY;
+
+       if (pcc_inode->i_mapping == mapping) {
+               LASSERT(mapping->host == pcc_inode);
+               LASSERT(mapping->a_ops == pcc_inode->i_mapping->a_ops);
+               RETURN(0);
+       }
+
+       if (pcc_inode->i_mapping != &pcc_inode->i_data)
+               RETURN(-EBUSY);
+       /*
+        * Write out all dirty pages and drop all pagecaches before switch the
+        * mapping from the PCC copy to the Lustre file for PCC mmap().
+        */
+
+       rc = filemap_write_and_wait_range(mapping, 0, LUSTRE_EOF);
+       if (rc)
+               return rc;
+
+       truncate_inode_pages(mapping, 0);
+
+       /* Wait all active I/Os on the PCC copy finished. */
+       wait_event_idle(pcci->pcci_waitq,
+                       atomic_read(&pcci->pcci_active_ios) == 0);
+
+       rc = filemap_write_and_wait_range(pcc_inode->i_mapping, 0, LUSTRE_EOF);
+       if (rc)
+               return rc;
+
+       if (ll_i2info(inode)->lli_pcc_dsflags & PCC_DATASET_MMAP_CONV) {
+               /*
+                * Move and convert all pagecache on the mapping of the PCC copy
+                * to the Lustre file.
+                */
+               rc = pcc_mmap_pages_convert(inode, pcc_inode);
+               if (rc)
+                       return rc;
+       } else {
+               /* Drop all pagecache on the PCC copy directly. */
+               truncate_inode_pages(pcc_inode->i_mapping, 0);
+       }
+
+       mapping->a_ops = pcc_inode->i_mapping->a_ops;
+       mapping->host = pcc_inode;
+       pcc_inode->i_mapping = mapping;
+
+       RETURN(rc);
+}
+
 int pcc_file_mmap(struct file *file, struct vm_area_struct *vma,
                  bool *cached)
 {
+       struct file *pcc_file = ll_file2pccf(file)->pccf_file;
        struct inode *inode = file_inode(file);
-       struct ll_file_data *fd = file->private_data;
-       struct file *pcc_file = fd->fd_pcc_file.pccf_file;
        struct pcc_inode *pcci;
        int rc = 0;
 
        ENTRY;
 
-       if (!pcc_file || !file_inode(pcc_file)->i_fop->mmap) {
-               *cached = false;
+       /* With PCC, the files are cached in an unusual way, then we do some
+        * special magic with mmap to allow Lustre and PCC to share the page
+        * mapping, and the @ra_pages may set with the backing device of PCC
+        * wrongly in this case. So we must manually set the @ra_pages with
+        * zero, otherwise it may result in kernel readahead occurring (which
+        * Lustre does not support).
+        */
+       file->f_ra.ra_pages = 0;
+
+       *cached = false;
+       if (!pcc_file || !file_inode(pcc_file)->i_fop->mmap)
                RETURN(0);
-       }
 
        pcc_inode_lock(inode);
        pcci = ll_i2pcci(inode);
        if (pcci && pcc_inode_has_layout(pcci)) {
+               struct inode *pcc_inode = file_inode(pcc_file);
+               struct pcc_vma *pccv;
+
                LASSERT(atomic_read(&pcci->pcci_refcount) > 1);
                *cached = true;
-               vma->vm_file = pcc_file;
+
+               rc = pcc_mmap_mapping_set(inode, pcc_inode);
+               if (rc)
+                       GOTO(out, rc);
+
+               OBD_ALLOC_PTR(pccv);
+               if (pccv == NULL)
+                       GOTO(out, rc = -ENOMEM);
+
+               pcc_file->f_mapping = file->f_mapping;
+               vma->vm_file = get_file(pcc_file);
                rc = file_inode(pcc_file)->i_fop->mmap(pcc_file, vma);
-               vma->vm_file = file;
+               if (rc || vma->vm_private_data) {
+                       /*
+                        * Check whether vma->vm_private_data is NULL.
+                        * We have used vm_private_data in our PCC mmap design,
+                        * it will cause conflict if the underlying PCC backend
+                        * filesystem is also using this private data structure.
+                        */
+                       if (vma->vm_private_data)
+                               rc = -EOPNOTSUPP;
+                       /*
+                        * If call ->mmap() fails, our caller will put Lustre
+                        * file so we should drop the reference to the PCC file
+                        * copy that we got.
+                        */
+                       fput(pcc_file);
+                       OBD_FREE_PTR(pccv);
+                       GOTO(out, rc);
+               }
+
                /* Save the vm ops of backend PCC */
-               vma->vm_private_data = (void *)vma->vm_ops;
+               pccv->pccv_vm_ops = vma->vm_ops;
+               pccv->pccv_file = file;
+               atomic_set(&pccv->pccv_refcnt, 0);
+               vma->vm_private_data = pccv;
+
+               CDEBUG(D_CACHE,
+                      DFID" vma %p size %llu len %lu pgoff %lu flags %lx\n",
+                      PFID(ll_inode2fid(inode)), vma, i_size_read(inode),
+                      vma->vm_end - vma->vm_start, vma->vm_pgoff,
+                      vma->vm_flags);
        } else {
                *cached = false;
        }
+out:
        pcc_inode_unlock(inode);
 
        RETURN(rc);
@@ -2337,78 +2704,91 @@ int pcc_file_mmap(struct file *file, struct vm_area_struct *vma,
 
 void pcc_vm_open(struct vm_area_struct *vma)
 {
-       struct pcc_inode *pcci;
-       struct file *file = vma->vm_file;
-       struct inode *inode = file_inode(file);
-       struct ll_file_data *fd = file->private_data;
-       struct file *pcc_file = fd->fd_pcc_file.pccf_file;
-       struct vm_operations_struct *pcc_vm_ops = vma->vm_private_data;
+       struct pcc_vma *pccv = (struct pcc_vma *)vma->vm_private_data;
+       struct vvp_object *vob;
+       struct pcc_file *pccf;
+       struct inode *inode;
 
        ENTRY;
 
-       if (!pcc_file || !pcc_vm_ops || !pcc_vm_ops->open)
+       if (!pccv)
                RETURN_EXIT;
 
-       pcc_inode_lock(inode);
-       pcci = ll_i2pcci(inode);
-       if (pcci && pcc_inode_has_layout(pcci)) {
-               vma->vm_file = pcc_file;
-               pcc_vm_ops->open(vma);
-               vma->vm_file = file;
-       }
-       pcc_inode_unlock(inode);
+       inode = file_inode(pccv->pccv_file);
+       vob = cl_inode2vvp(inode);
+       LASSERT(atomic_read(&vob->vob_mmap_cnt) >= 0);
+       atomic_inc(&vob->vob_mmap_cnt);
+
+       pccf = ll_file2pccf(pccv->pccv_file);
+       atomic_inc(&pccv->pccv_refcnt);
+       if (pccv->pccv_vm_ops->open)
+               pccv->pccv_vm_ops->open(vma);
+
+       pcc_inode_mmap_get(inode);
+
        EXIT;
 }
 
 void pcc_vm_close(struct vm_area_struct *vma)
 {
-       struct file *file = vma->vm_file;
-       struct inode *inode = file_inode(file);
-       struct ll_file_data *fd = file->private_data;
-       struct file *pcc_file = fd->fd_pcc_file.pccf_file;
-       struct vm_operations_struct *pcc_vm_ops = vma->vm_private_data;
+       struct pcc_vma *pccv = (struct pcc_vma *)vma->vm_private_data;
+       struct vvp_object *vob;
+       struct pcc_file *pccf;
+       struct inode *inode;
 
        ENTRY;
 
-       if (!pcc_file || !pcc_vm_ops || !pcc_vm_ops->close)
+       if (!pccv)
                RETURN_EXIT;
 
-       pcc_inode_lock(inode);
-       /* Layout lock maybe revoked here */
-       vma->vm_file = pcc_file;
-       pcc_vm_ops->close(vma);
-       vma->vm_file = file;
-       pcc_inode_unlock(inode);
+       inode = file_inode(pccv->pccv_file);
+       LASSERT(ll_i2info(inode) != NULL);
+       vob = cl_inode2vvp(inode);
+       atomic_dec(&vob->vob_mmap_cnt);
+       LASSERT(atomic_read(&vob->vob_mmap_cnt) >= 0);
+
+       if (pccv->pccv_vm_ops && pccv->pccv_vm_ops->close)
+               pccv->pccv_vm_ops->close(vma);
+
+       pccf = ll_file2pccf(pccv->pccv_file);
+       pcc_inode_mmap_put(inode);
+       if (atomic_dec_and_test(&pccv->pccv_refcnt)) {
+               fput(pccv->pccv_file);
+               CDEBUG(D_CACHE,
+                     "release pccv "DFID" vm_file %p:%ld lu_file %p:%ld\n",
+                      PFID(ll_inode2fid(inode)),
+                      vma->vm_file, file_count(vma->vm_file),
+                      pccv->pccv_file, file_count(pccv->pccv_file));
+               OBD_FREE_PTR(pccv);
+       }
+
        EXIT;
 }
 
 int pcc_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
                     bool *cached)
 {
-       struct page *page = vmf->page;
+       struct pcc_vma *pccv = (struct pcc_vma *)vma->vm_private_data;
        struct mm_struct *mm = vma->vm_mm;
-       struct file *file = vma->vm_file;
-       struct inode *inode = file_inode(file);
-       struct ll_file_data *fd = file->private_data;
-       struct file *pcc_file = fd->fd_pcc_file.pccf_file;
-       struct vm_operations_struct *pcc_vm_ops = vma->vm_private_data;
+       struct inode *inode;
        int rc;
 
        ENTRY;
 
-       if (!pcc_file || !pcc_vm_ops) {
+       if (!pccv || !pccv->pccv_vm_ops) {
                *cached = false;
                RETURN(0);
        }
 
-       if (!pcc_vm_ops->page_mkwrite &&
-           page->mapping == pcc_file->f_mapping) {
+       inode = file_inode(pccv->pccv_file);
+       if (!pccv->pccv_vm_ops->page_mkwrite) {
                __u32 flags = PCC_DETACH_FL_UNCACHE;
 
                CDEBUG(D_MMAP,
                       "%s: PCC backend fs not support ->page_mkwrite()\n",
                       ll_i2sbi(inode)->ll_fsname);
                (void) pcc_ioctl_detach(inode, &flags);
+               pcc_mmap_vma_reset(inode, vma);
                mmap_read_unlock(mm);
                *cached = true;
                RETURN(VM_FAULT_RETRY | VM_FAULT_NOPAGE);
@@ -2416,7 +2796,7 @@ int pcc_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
        /* Pause to allow for a race with concurrent detach */
        CFS_FAIL_TIMEOUT(OBD_FAIL_LLITE_PCC_MKWRITE_PAUSE, cfs_fail_val);
 
-       pcc_io_init(inode, PIT_PAGE_MKWRITE, cached);
+       pcc_mmap_io_init(inode, PIT_PAGE_MKWRITE, vma, cached);
        if (!*cached) {
                /* This happens when the file is detached from PCC after got
                 * the fault page via ->fault() on the inode of the PCC copy.
@@ -2434,16 +2814,14 @@ int pcc_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
                 * VM_FAULT_NOPAGE | VM_FAULT_RETRY to the caller
                 * __do_page_fault and retry the memory fault handling.
                 */
-               if (page->mapping == pcc_file->f_mapping) {
-                       __u32 flags = PCC_DETACH_FL_UNCACHE;
 
-                       pcc_ioctl_detach(inode, &flags);
-                       *cached = true;
-                       mmap_read_unlock(mm);
-                       RETURN(VM_FAULT_RETRY | VM_FAULT_NOPAGE);
-               }
+               LASSERT(vma->vm_file == pccv->pccv_file);
+               if (vmf->page->mapping == &inode->i_data)
+                       RETURN(0);
 
-               RETURN(0);
+               *cached = true;
+               mmap_read_unlock(mm);
+               RETURN(VM_FAULT_RETRY | VM_FAULT_NOPAGE);
        }
 
        /*
@@ -2453,13 +2831,11 @@ int pcc_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
        if (CFS_FAIL_CHECK(OBD_FAIL_LLITE_PCC_DETACH_MKWRITE))
                GOTO(out, rc = VM_FAULT_SIGBUS);
 
-       vma->vm_file = pcc_file;
 #ifdef HAVE_VM_OPS_USE_VM_FAULT_ONLY
-       rc = pcc_vm_ops->page_mkwrite(vmf);
+       rc = pccv->pccv_vm_ops->page_mkwrite(vmf);
 #else
-       rc = pcc_vm_ops->page_mkwrite(vma, vmf);
+       rc = pccv->pccv_vm_ops->page_mkwrite(vma, vmf);
 #endif
-       vma->vm_file = file;
 
 out:
        pcc_io_fini(inode, PIT_PAGE_MKWRITE, rc, cached);
@@ -2472,6 +2848,7 @@ out:
                __u32 flags = PCC_DETACH_FL_UNCACHE;
 
                (void) pcc_ioctl_detach(inode, &flags);
+               pcc_mmap_vma_reset(inode, vma);
                mmap_read_unlock(mm);
                RETURN(VM_FAULT_RETRY | VM_FAULT_NOPAGE);
        }
@@ -2481,26 +2858,24 @@ out:
 int pcc_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
              bool *cached)
 {
-       struct file *file = vma->vm_file;
-       struct inode *inode = file_inode(file);
-       struct ll_file_data *fd = file->private_data;
-       struct file *pcc_file = fd->fd_pcc_file.pccf_file;
-       struct vm_operations_struct *pcc_vm_ops = vma->vm_private_data;
+       struct pcc_vma *pccv = (struct pcc_vma *)vma->vm_private_data;
+       struct inode *inode;
        int rc;
 
        ENTRY;
 
-       if (!pcc_file || !pcc_vm_ops || !pcc_vm_ops->fault) {
+       if (!pccv) {
                *cached = false;
                RETURN(0);
        }
 
+       inode = file_inode(pccv->pccv_file);
        if (!S_ISREG(inode->i_mode)) {
                *cached = false;
                RETURN(0);
        }
 
-       pcc_io_init(inode, PIT_FAULT, cached);
+       pcc_mmap_io_init(inode, PIT_FAULT, vma, cached);
        if (!*cached)
                RETURN(0);
 
@@ -2508,51 +2883,25 @@ int pcc_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
        if (CFS_FAIL_CHECK(OBD_FAIL_LLITE_PCC_FAKE_ERROR))
                GOTO(out, rc = VM_FAULT_SIGBUS);
 
-       vma->vm_file = pcc_file;
 #ifdef HAVE_VM_OPS_USE_VM_FAULT_ONLY
-       rc = pcc_vm_ops->fault(vmf);
+       rc = pccv->pccv_vm_ops->fault(vmf);
 #else
-       rc = pcc_vm_ops->fault(vma, vmf);
+       rc = pccv->pccv_vm_ops->fault(vma, vmf);
 #endif
-       vma->vm_file = file;
+
 out:
        pcc_io_fini(inode, PIT_FAULT, rc, cached);
-       RETURN(rc);
-}
-
-static void __pcc_layout_invalidate(struct pcc_inode *pcci)
-{
-       pcci->pcci_type = LU_PCC_NONE;
-       pcc_layout_gen_set(pcci, CL_LAYOUT_GEN_NONE);
-       if (atomic_read(&pcci->pcci_active_ios) == 0)
-               return;
 
-       CDEBUG(D_CACHE, "Waiting for IO completion: %d\n",
-                      atomic_read(&pcci->pcci_active_ios));
-       wait_event_idle(pcci->pcci_waitq,
-                       atomic_read(&pcci->pcci_active_ios) == 0);
-}
-
-void pcc_layout_invalidate(struct inode *inode)
-{
-       struct pcc_inode *pcci;
-
-       ENTRY;
-
-       pcc_inode_lock(inode);
-       pcci = ll_i2pcci(inode);
-       if (pcci && pcc_inode_has_layout(pcci)) {
-               LASSERT(atomic_read(&pcci->pcci_refcount) > 0);
-               __pcc_layout_invalidate(pcci);
-
-               CDEBUG(D_CACHE, "Invalidate "DFID" layout gen %d\n",
-                      PFID(&ll_i2info(inode)->lli_fid), pcci->pcci_layout_gen);
+       if ((rc & VM_FAULT_SIGBUS) && !(rc & VM_FAULT_OOM)) {
+               __u32 flags = PCC_DETACH_FL_UNCACHE;
 
-               pcc_inode_put(pcci);
+               CDEBUG(D_CACHE, "PCC fault failed: fid = "DFID" rc = %d\n",
+                      PFID(ll_inode2fid(inode)), rc);
+               (void) pcc_ioctl_detach(inode, &flags);
+               pcc_mmap_vma_reset(inode, vma);
        }
-       pcc_inode_unlock(inode);
 
-       EXIT;
+       RETURN(rc);
 }
 
 static int pcc_inode_remove(struct inode *inode, struct dentry *pcc_dentry)
@@ -3332,10 +3681,9 @@ int pcc_ioctl_detach(struct inode *inode, __u32 *flags)
                        lli->lli_pcc_dsflags = PCC_DATASET_NONE;
                }
 
-               __pcc_layout_invalidate(pcci);
-               pcc_inode_put(pcci);
+               pcc_inode_detach_put(inode);
        } else if (pcci->pcci_type == LU_PCC_READONLY) {
-               __pcc_layout_invalidate(pcci);
+               pcc_inode_detach(inode);
 
                if (*flags & PCC_DETACH_FL_UNCACHE && !pcci->pcci_unlinked) {
                        old_cred =  override_creds(pcc_super_cred(inode->i_sb));
index b059004..73be3fe 100644 (file)
@@ -135,6 +135,8 @@ enum pcc_dataset_flags {
        PCC_DATASET_PCC_ALL     = PCC_DATASET_PCCRW | PCC_DATASET_PCCRO,
        /* Default PCC caching mode: PCC-RO mode */
        PCC_DATASET_PCC_DEFAULT = PCC_DATASET_PCCRO,
+       /* Move pagecache from mapping of PCC copy to Lustre file for mmap */
+       PCC_DATASET_MMAP_CONV   = 0x40,
 };
 
 struct pcc_dataset {
@@ -197,6 +199,12 @@ struct pcc_file {
        enum lu_pcc_type         pccf_type;
 };
 
+struct pcc_vma {
+       atomic_t                                 pccv_refcnt;
+       struct file                             *pccv_file;
+       const struct vm_operations_struct       *pccv_vm_ops;
+};
+
 enum pcc_io_type {
        /* read system call */
        PIT_READ = 1,
@@ -296,4 +304,18 @@ struct pcc_dataset *pcc_dataset_match_get(struct pcc_super *super,
 void pcc_dataset_put(struct pcc_dataset *dataset);
 void pcc_inode_free(struct inode *inode);
 void pcc_layout_invalidate(struct inode *inode);
+
+static inline struct file *pcc_vma_file(struct vm_area_struct *vma)
+{
+       struct pcc_vma *pccv = (struct pcc_vma *)vma->vm_private_data;
+       struct file *file;
+
+       if (pccv)
+               file = pccv->pccv_file;
+       else
+               file = vma->vm_file;
+
+       return file;
+}
+
 #endif /* LLITE_PCC_H */
index 04d5add..0ea0358 100644 (file)
@@ -1914,10 +1914,12 @@ int ll_readpage(struct file *file, struct page *vmpage)
        struct inode *inode = file_inode(file);
        struct cl_object *clob = ll_i2info(inode)->lli_clob;
        struct ll_sb_info *sbi = ll_i2sbi(inode);
+       struct super_block *sb = inode->i_sb;
        const struct lu_env *env = NULL;
        struct cl_read_ahead ra = { 0 };
        struct ll_cl_context *lcc;
        struct cl_io *io = NULL;
+       bool ra_assert = false;
        struct cl_page *page;
        struct vvp_io *vio;
        int result;
@@ -1932,6 +1934,20 @@ int ll_readpage(struct file *file, struct page *vmpage)
        }
 
        /*
+        * This is not a Lustre file handle, and should be a file handle of the
+        * PCC copy. It is from PCC mmap readahead I/O path and the PCC copy
+        * was invalidated.
+        * Here return error code directly as it is from readahead I/O path for
+        * the PCC copy.
+        */
+       if (inode->i_op != &ll_file_inode_operations) {
+               CERROR("%s: readpage() on invalidated PCC inode %lu: rc=%d\n",
+                      sb->s_id, inode->i_ino, -EIO);
+               unlock_page(vmpage);
+               RETURN(-EIO);
+       }
+
+       /*
         * The @vmpage got truncated.
         * This is a kernel bug introduced since kernel 5.12:
         * comment: cbd59c48ae2bcadc4a7599c29cf32fd3f9b78251
@@ -2070,6 +2086,36 @@ int ll_readpage(struct file *file, struct page *vmpage)
                }
        }
 
+       /* this is a sequence of checks verifying that kernel readahead is
+        * truly disabled
+        */
+       if (lcc && lcc->lcc_type == LCC_MMAP) {
+               if (io->u.ci_fault.ft_index != vmpage->index) {
+                       CERROR("%s: ft_index %lu, vmpage index %lu\n",
+                              sbi->ll_fsname, io->u.ci_fault.ft_index,
+                              vmpage->index);
+                       ra_assert = true;
+               }
+       }
+
+       if (ra_assert || sb->s_bdi->ra_pages != 0 || file->f_ra.ra_pages != 0) {
+               CERROR("%s: sbi ra pages %lu, file ra pages %d\n",
+                      sbi->ll_fsname, sb->s_bdi->ra_pages,
+                      file->f_ra.ra_pages);
+               ra_assert = true;
+       }
+
+
+#ifdef HAVE_BDI_IO_PAGES
+       if (ra_assert || sb->s_bdi->io_pages != 0) {
+               CERROR("%s: bdi io_pages %lu\n",
+                      sbi->ll_fsname, sb->s_bdi->io_pages);
+               ra_assert = true;
+       }
+#endif
+       if (ra_assert)
+               LASSERT(!ra_assert);
+
        vio = vvp_env_io(env);
        /*
         * Direct read can fall back to buffered read, but DIO is done
index aea4e3f..78ca667 100644 (file)
@@ -472,11 +472,12 @@ static int vvp_mmap_locks(const struct lu_env *env,
 
                mmap_read_lock(mm);
                while ((vma = our_vma(mm, addr, bytes)) != NULL) {
-                       struct dentry *de = file_dentry(vma->vm_file);
+                       struct file *file = pcc_vma_file(vma);
+                       struct dentry *de = file_dentry(file);
                        struct inode *inode = de->d_inode;
                        int flags = CEF_MUST;
 
-                       if (ll_file_nolock(vma->vm_file)) {
+                       if (ll_file_nolock(file)) {
                                /* For no lock case is not allowed for mmap */
                                result = -EINVAL;
                                break;
@@ -1033,7 +1034,8 @@ static void vvp_set_batch_dirty(struct folio_batch *fbatch)
        struct page *page = fbatch_at_pg(fbatch, 0, 0);
        int count = folio_batch_count(fbatch);
        int i;
-#if !defined(HAVE_FOLIO_BATCH) || defined(HAVE_KALLSYMS_LOOKUP_NAME)
+#if !defined(HAVE_FOLIO_BATCH) || !defined(HAVE_FILEMAP_GET_FOLIOS) || \
+       defined(HAVE_KALLSYMS_LOOKUP_NAME)
        int pg, npgs;
 #endif
 #ifdef HAVE_KALLSYMS_LOOKUP_NAME
@@ -1060,7 +1062,7 @@ static void vvp_set_batch_dirty(struct folio_batch *fbatch)
 #ifndef HAVE_ACCOUNT_PAGE_DIRTIED_EXPORT
        if (!vvp_account_page_dirtied) {
                for (i = 0; i < count; i++) {
-#ifdef HAVE_FOLIO_BATCH
+#if defined(HAVE_FOLIO_BATCH) && defined(HAVE_FILEMAP_GET_FOLIOS)
                        filemap_dirty_folio(page->mapping, fbatch->folios[i]);
 #else
                        npgs = fbatch_at_npgs(fbatch, i);
index b09ba1f..4dcd485 100644 (file)
@@ -135,6 +135,7 @@ static int vvp_conf_set(const struct lu_env *env, struct cl_object *obj,
 
                ll_layout_version_set(lli, CL_LAYOUT_GEN_NONE);
 
+               pcc_layout_invalidate(conf->coc_inode);
                /* Clean up page mmap for this inode.
                 * The reason for us to do this is that if the page has
                 * already been installed into memory space, the process
@@ -146,7 +147,6 @@ static int vvp_conf_set(const struct lu_env *env, struct cl_object *obj,
                 */
                unmap_mapping_range(conf->coc_inode->i_mapping,
                                    0, OBD_OBJECT_EOF, 0);
-               pcc_layout_invalidate(conf->coc_inode);
        }
        return 0;
 }
index 7f33dfa..e177ac1 100644 (file)
@@ -108,7 +108,7 @@ if NO_STRINGOP_OVERFLOW
 badarea_io_CFLAGS=-Wno-stringop-overflow
 endif # NO_STRINGOP_OVERFLOW
 
-mmap_sanity_LDADD = $(LIBLUSTREAPI)
+mmap_sanity_LDADD = $(LIBLUSTREAPI) $(PTHREAD_LIBS)
 multiop_LDADD = $(LIBLUSTREAPI) $(PTHREAD_LIBS)
 llapi_layout_test_LDADD = $(LIBLUSTREAPI)
 llapi_hsm_test_LDADD = $(LIBLUSTREAPI)
index f3cc90f..e06fd80 100644 (file)
@@ -37,6 +37,7 @@
 #include <limits.h>
 #include <stdio.h>
 #include <unistd.h>
+#include <pthread.h>
 #include <stdlib.h>
 #include <fcntl.h>
 #include <getopt.h>
@@ -367,7 +368,9 @@ static int mmap_tst4(char *mnt)
        if (rc)
                goto out_unmap;
 
-       memset(ptr, '1', region);
+       memset(ptr, '1', region / 2);
+       sleep(2);
+       memset(ptr + region / 2, '1', region / 2);
 
        rc = write(fdw, ptr, region);
        if (rc <= 0) {
@@ -419,7 +422,9 @@ static int remote_tst4(char *mnt)
                goto out_close;
        }
 
-       memset(ptr, '2', region);
+       memset(ptr, '2', region / 2);
+       sleep(2);
+       memset(ptr + region / 2, '2', region / 2);
 
        rc = write(fdw, ptr, region);
        if (rc <= 0) {
@@ -806,8 +811,193 @@ out:
        return rc;
 }
 
+#define NUM_THREADS    8
+#define BUF_SIZE       4096
+#define MAX_LEN                1048576
+#define MAX_FILESIZE   (NUM_THREADS * MAX_LEN)
+
+struct thread_data {
+       int      td_tid;
+       char    *td_buf;
+       size_t   td_len;
+       int      td_fd;
+};
+
+static int mmap_create_file(char *fname, size_t size)
+{
+       char buf[BUF_SIZE];
+       ssize_t written = 0;
+       int rc = 0;
+       int fd;
+
+       fd = open(fname, O_WRONLY | O_CREAT, 0666);
+       if (fd == -1) {
+               perror("open");
+               return -errno;
+       }
+
+       memset(buf, 'Q', sizeof(buf));
+       while (written < size) {
+               ssize_t ret;
+
+               ret = write(fd, buf, BUF_SIZE);
+               if (ret != BUF_SIZE) {
+                       fprintf(stderr, "failed to write %s: %s\n",
+                               fname, strerror(errno));
+                       rc = -errno;
+                       goto out;
+               }
+
+               written += ret;
+       }
+
+out:
+       close(fd);
+       return rc;
+}
+
+static void *tst10_thread(void *arg)
+{
+       struct thread_data *data = (struct thread_data *)arg;
+       size_t size = data->td_len;
+       int fd = data->td_fd;
+       char buf[BUF_SIZE];
+       loff_t offset = 0;
+       char *ptr;
+       int rc;
+
+       ptr = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0);
+       if (ptr == MAP_FAILED) {
+               rc = errno;
+               perror("mmap()");
+               exit(rc);
+       }
+
+       do {
+               memcpy(buf, ptr + offset, sizeof(buf));
+               offset += sizeof(buf);
+       } while (offset < size);
+
+       rc = munmap(ptr, size);
+       if (rc == -1) {
+               rc = errno;
+               perror("nunmap");
+               exit(rc);
+       }
+
+       return NULL;
+}
+
 static int mmap_tst10(char *mnt)
 {
+       pthread_t thread[NUM_THREADS];
+       struct thread_data data;
+       char fname[PATH_MAX];
+       int rc = 0;
+       int fd;
+       int i;
+
+       if (snprintf(fname, PATH_MAX, "%s/mmap_tst10", mnt) >= PATH_MAX) {
+               fprintf(stderr, "file name too long\n");
+               return -ENAMETOOLONG;
+       }
+
+       rc = mmap_create_file(fname, MAX_FILESIZE);
+       if (rc)
+               return rc;
+
+       fd = open(fname, O_RDONLY | O_DIRECT, 0644);
+       if (fd == -1) {
+               rc = -errno;
+               perror("open");
+               return rc;
+       }
+
+       data.td_fd = fd;
+       data.td_len = MAX_FILESIZE;
+       for (i = 0; i < NUM_THREADS; i++)
+               pthread_create(&thread[i], NULL, tst10_thread, &data);
+
+       for (i = 0; i < NUM_THREADS; i++)
+               pthread_join(thread[i], NULL);
+
+       close(fd);
+       unlink(fname);
+       return rc;
+}
+
+static void *tst11_thread(void *arg)
+{
+       struct thread_data *data = (struct thread_data *)arg;
+       char *ptr = data->td_buf;
+       char buf[BUF_SIZE];
+       loff_t offset = 0;
+
+       do {
+               memcpy(buf, ptr + offset, sizeof(buf));
+               offset += sizeof(buf);
+       } while (offset < data->td_len);
+
+       return NULL;
+}
+
+static int mmap_tst11(char *mnt)
+{
+       pthread_t thread[NUM_THREADS];
+       struct thread_data data[NUM_THREADS];
+       char fname[PATH_MAX];
+       char *ptr;
+       int rc = 0;
+       int fd;
+       int i;
+
+       if (snprintf(fname, PATH_MAX, "%s/mmap_tst11", mnt) >= PATH_MAX) {
+               fprintf(stderr, "file name too long\n");
+               return -ENAMETOOLONG;
+       }
+
+       rc = mmap_create_file(fname, MAX_FILESIZE);
+       if (rc)
+               return rc;
+
+       fd = open(fname, O_RDONLY | O_DIRECT, 0644);
+       if (fd == -1) {
+               rc = -errno;
+               perror("open");
+               return rc;
+       }
+
+       ptr = mmap(NULL, MAX_FILESIZE, PROT_READ, MAP_PRIVATE, fd, 0);
+       if (ptr == MAP_FAILED) {
+               rc = errno;
+               perror("mmap()");
+               exit(rc);
+       }
+
+       for (i = 0; i < NUM_THREADS; i++) {
+               data[i].td_tid = i;
+               data[i].td_len = MAX_LEN;
+               data[i].td_buf = ptr + i * MAX_LEN;
+               pthread_create(&thread[i], NULL, tst11_thread, &data[i]);
+       }
+
+       for (i = 0; i < NUM_THREADS; i++)
+               pthread_join(thread[i], NULL);
+
+       rc = munmap(ptr, MAX_FILESIZE);
+       if (rc == -1) {
+               rc = errno;
+               perror("nunmap");
+               exit(rc);
+       }
+
+       close(fd);
+       unlink(fname);
+       return rc;
+}
+
+static int mmap_tst12(char *mnt)
+{
        char *buf = MAP_FAILED;
        char *buffer[256];
        struct stat st1, st2;
@@ -951,11 +1141,23 @@ struct test_case tests[] = {
        },
        {
                .tc             = 10,
-               .desc           = "mmap test10: mtime not change for readonly mmap access",
+               .desc           = "mmap test10: multi-thread mmap access",
                .test_fn        = mmap_tst10,
                .node_cnt       = 1
        },
        {
+               .tc             = 11,
+               .desc           = "mmap test11: multi-thread shared mmap",
+               .test_fn        = mmap_tst11,
+               .node_cnt       = 1
+       },
+       {
+               .tc             = 12,
+               .desc           = "mmap test10: mtime not change for readonly mmap access",
+               .test_fn        = mmap_tst12,
+               .node_cnt       = 1
+       },
+       {
                .tc             = 0
        }
 };
index 1425cfc..2bcdd5f 100755 (executable)
@@ -57,7 +57,7 @@ fi
 if [[ -r /etc/redhat-release ]]; then
        rhel_version=$(sed -e 's/[^0-9.]*//g' /etc/redhat-release)
        if (( $(version_code $rhel_version) >= $(version_code 9.3.0) )); then
-               always_except EX-8739 6 7a 7b 23    # PCC-RW
+               always_except EX-8739 6 7a 7b 23 35 # PCC-RW
                always_except LU-17289 102          # fio io_uring
                always_except LU-17781 33           # inconsistent LSOM
        fi
@@ -768,7 +768,7 @@ test_4() {
        local loopfile="$TMP/$tfile"
        local mntpt="/mnt/pcc.$tdir"
        local hsm_root="$mntpt/$tdir"
-       local excepts="-e 6 -e 7 -e 8 -e 9"
+       local excepts="-e 7 -e 8 -e 9"
 
        ! is_project_quota_supported &&
                skip "project quota is not supported" && return
@@ -784,16 +784,14 @@ test_4() {
 
        # 1. mmap_sanity tst7 failed on the local ext4 filesystem.
        #    It seems that Lustre filesystem does special process for tst 7.
-       # 2. There is a mmap problem for PCC when multiple clients read/write
-       #    on a shared mmapped file for mmap_sanity tst 6.
-       # 3. Current CentOS8 kernel does not strictly obey POSIX syntax for
+       # 2. Current CentOS8 kernel does not strictly obey POSIX syntax for
        #    mmap() within the maping but beyond current end of the underlying
        #    files: It does not send SIGBUS signals to the process.
-       # 4. For negative file offset, sanity_mmap also failed on 48 bits
+       # 3. For negative file offset, sanity_mmap also failed on 48 bits
        #    ldiksfs backend due to too large offset: "Value too large for
        #    defined data type".
        # mmap_sanity tst7/tst8/tst9 all failed on Lustre and local ext4.
-       # Thus, we exclude sanity tst6/tst7/tst8/tst9 from the PCC testing.
+       # Thus, we exclude sanity tst7/tst8/tst9 from the PCC testing.
        $LUSTRE/tests/mmap_sanity -d $DIR/$tdir -m $DIR2/$tdir $excepts ||
                error "mmap_sanity test failed"
        sync; sleep 1; sync
@@ -2324,7 +2322,7 @@ test_25() {
                error "failed to fall back to Lustre I/O path for mmap-read"
        # Above mmap read will return VM_FAULT_SIGBUS failure and
        # retry the IO on normal IO path.
-       check_lpcc_state $file "readonly"
+       check_lpcc_state $file "none"
        check_file_data $SINGLEAGT $file "ro_fake_mmap_cat_err"
 
        do_facet $SINGLEAGT $LFS pcc detach $file ||
@@ -2910,6 +2908,39 @@ test_34() {
 }
 run_test 34 "Cache rule with comparator (>, <) for Project ID range"
 
+test_35() {
+       local loopfile="$TMP/$tfile"
+       local mntpt="/mnt/pcc.$tdir"
+       local hsm_root="$mntpt/$tdir"
+       local file=$DIR/$tfile
+       local -a lpcc_path
+
+       setup_loopdev $SINGLEAGT $loopfile $mntpt 50
+       copytool setup -m "$MOUNT" -a "$HSM_ARCHIVE_NUMBER"
+       setup_pcc_mapping
+
+       echo "pccro_mmap_data" > $file
+       lpcc_path=$(lpcc_fid2path $hsm_root $file)
+       do_facet $SINGLEAGT $LFS pcc attach -r -i $HSM_ARCHIVE_NUMBER $file ||
+               error "failed to PCC-RO attach file $file"
+       check_lpcc_state $file "readonly"
+       check_lpcc_data $SINGLEAGT $lpcc_path $file "pccro_mmap_data"
+
+       local content=$(do_facet $SINGLEAGT $MMAP_CAT $file)
+
+       [[ $content == "pccro_mmap_data" ]] ||
+               error "mmap_cat data mismatch: $content"
+       check_lpcc_state $file "readonly"
+
+       do_facet $SINGLEAGT $LFS pcc detach $file ||
+               error "failed to PCC-RO detach $file"
+       content=$(do_facet $SINGLEAGT $MMAP_CAT $file)
+       [[ $content == "pccro_mmap_data" ]] ||
+               error "mmap_cat data mismatch: $content"
+       check_lpcc_state $file "none"
+}
+run_test 35 "mmap fault test"
+
 test_36_base() {
        local loopfile="$TMP/$tfile"
        local mntpt="/mnt/pcc.$tdir"