#LyX 1.3 created this file. For more info see http://www.lyx.org/ \lyxformat 221 \textclass article \language english \inputencoding auto \fontscheme times \graphics default \paperfontsize 12 \spacing single \papersize Default \paperpackage a4 \use_geometry 0 \use_amsmath 0 \use_natbib 0 \use_numerical_citations 0 \paperorientation portrait \secnumdepth 3 \tocdepth 3 \paragraph_separation skip \defskip medskip \quotes_language english \quotes_times 2 \papercolumns 1 \papersides 1 \paperpagestyle default \layout Title High Level Design of Remote UID/GID Handling \layout Author Peter Braam, Eric Mei \layout Date Jan 27, 2005 \layout Section From the ERS (Engineering Requirements Spec, formerly Architecture) \layout Itemize Perform uid/gid translation between remote clients and local user database. \layout Itemize Handling client program calling setuid/setgid/setgroups syscalls to get unusual previlege . \layout Itemize Handling supplementary groups membership. \layout Itemize Various security policies in situations with/without strong authentication like Kerberos V5. \layout Paragraph NOTE: \layout Itemize remote clients may have different user database from that of MDS's. \layout Itemize The remote ACL issues is addressed by a separate module. \layout Itemize Most content of this document has been described in Lustre Book. \layout Standard The architecture prescribes a translation mechanism at the MDS: the MDS will translate a locally found uid/gid, which is obtained through the kerberos principal. \layout Section Functional Specification \layout Subsection Determine local/remote clients \layout Itemize \begin_inset Quotes eld \end_inset local \begin_inset Quotes erd \end_inset client is the client node which is supposed to share the same user database with MDS's. \layout Itemize \begin_inset Quotes eld \end_inset remote \begin_inset Quotes erd \end_inset client is the client node which is supposed to have different user database from MDS's. \layout Standard The MDS's will be able to determine that a client node is a local or remote one, upon the client's first connection time to the MDS, and reply back it's decision to client. Later both MDS and client will make different operation decision according to this flag. This remote flag is per-client, not per user. Once MDS made the decision, it will keep unchanged until client leave the cluster membership (umount or so). \layout Standard MDS will do many conversion (mostly uid/gid mapping) for users on remote clients because of the user database mismatch, and due to the nature of this mismatch we have to put some limitation on users of remote clients, compare to local clients. Following sections have the details description. \layout Subsection Mapping uid/gid from clients \layout Standard For local client, obviously we don't need do any uid/gid mapping. For remote clients, we need translate uid/gid in each request into one which lives in local user database; and vice versa: translate uid/gid in reply into the one in remote user database. This translation affects the uid/gid's found in the inode as owner/group, the security context which describes under what uid the MDS is executing and in some cases (chown is a good example) the arguments of calls. \layout Standard Each MDS will have to access a uid-mapping database, which prescribed that: which principal from which nid/netid should be mapped to which local uid. The mapping database must be the same to every MDS to get consistent result. During runtime, the a remote user authenticated with the MDS, the corresponding mapping entry will be read from the on-disk database and cached in the kernel via an upcall. Note the same principal from different clients might be mapped to different local user, according to the mapping database. So on each MDS there's a per-client structure which maintained the uid mapping cache. \layout Standard Each remote client must have nllu/nllg installed. 'nllu' is for \begin_inset Quotes eld \end_inset Non Local Lustre User \begin_inset Quotes erd \end_inset , while 'nllg' for \begin_inset Quotes eld \end_inset Non Local Lustre Group \begin_inset Quotes erd \end_inset . When client firstly mount a lustre fileset, it should notify MDS which local uid/gid act as nllu/nllg. MDS will translate those unrecognized uid/gid to this before send reply to client. Thus from client's perspect of view, those files which belong to unauthorized users will be shown as belonging to nllu/nllg. \layout Subsection Lustre security description (LSD) \layout Standard There's a security configure database on each MDS, which describes who(uid) from where(nid/netid) have permission to setuid/setgid/setgroups. Later we might add more into it. the database must be the same to every MDS to get consistent result. \layout Standard LSD refers to the in-kernel data structure which describe an user's security property on the MDS. It roughly be defined as: \layout LyX-Code struct lustre_sec_desc { \layout LyX-Code uid_t uid; \layout LyX-Code gid_t gid; \layout LyX-Code supp_grp_t supp_grp; \layout LyX-Code setxid_desc setxid; \layout LyX-Code /* more security tags added here */ \layout LyX-Code }; \layout Standard In the future we'll add more special security tag into it. Each LSD entry correspond to an user in the local user database. the 'setxid_desc' must have the ability to describe setuid/setgid/setgroups permission for different clients respectively. \layout Standard LSD cache is populated via an upcall during runtime. The user-level helper will be feed in uid as a parameter, and found out this uid's principal gid and supplementary groups from local user database, and find setxid permission bits and other security tags from on-disk security database. \layout Standard Each LSD entry have limited expiration time, and will be flushed out when expired. Next request come from this user will result in the LSD be populated again, with the uptodate security settings if changed. System administrator also could choose to flush certain user's LSD forcely. \layout Standard Every filesystem access request from client need go through checking of LSD. This checking is uid based, for those request coming from remote client, uid will be mapped at first as described above, and then go to LSD. \layout Subsection The MDS security context \layout Standard All kernel-level service threads running on MDS are running as root, waiting request from other nodes, and provide services. But for those request to access filesystem for certain user, those threads must act as the user, running as its identities. Thus such a request comes in, we firstly collect the identity information for this user as above described, include uid, gid, etc., then switch the identity in the process context before really execute the filesystem operation; we also need switch the root directory of process to the root of MDS's backend filesystem. after it finished, we switch back to the original context, prepare to the next service. \layout Standard For some request for special service like llog handling, special interaction between MDSs, which don't represent any certain user, and require keeping the root privilege. In those situation we don't need do such context switch, also user identity preparation. \layout Subsection Remote client cache flushing \layout Standard For a remote client, it should realize that those locally cached file's owner information, e.g. owner, group, is ever translated by server side, some mapping might be stale as time goes on. for example: a user newly authenticated, while some cached file which should be owned by him still shows owner is \begin_inset Quotes eld \end_inset nllu \begin_inset Quotes erd \end_inset . client must choose the proper time to flush those stale owner informations, to give user a consistent view. All attribute locks held by clients must be given a revocation callback when a new user connects. \layout Section Use Cases \layout Subsection Connect rpc from local realm (case 1) \layout Enumerate Alice doing 'mount' \layout Enumerate Alice sends the first ptlrpc request (MDS_CONNECT) without GSS security to MDS; \layout Enumerate mds_handle() will initialize per-client structure, clear the remote flag in it; \layout Enumerate After successful connection done, the MDS send the remote flag back to client for future usage in client side. \layout Subsection Connect rpc from local realm (case 2) \layout Enumerate Alice doing 'mount' \layout Enumerate Alice from a MDS local realm sends the first ptlrpc request (MDS_CONNECT) with GSS security to MDS; \layout Enumerate MDS svcgssd will determine it's from a local realm client; \layout Enumerate mds_handle() will initialize per-client structure, clear the remote flag in it; \layout Enumerate After successful connection done, MDS will send the remote flag back to client for future usage in client side. \layout Subsection Connect rpc from remote realm \layout Enumerate Alice from a MDS remote realm sends the first ptlrpc request (MDS_CONNECT) with GSS security to MDS, along with its nllu/nllg id number; \layout Enumerate MDS svcgssd will determine it's from a remote realm client; \layout Enumerate mds_handle() logic will initialize per-client structure: \begin_deeper \layout Enumerate Set the remote flag in it; \layout Enumerate Fill in the nllu/nllg ids obtained from client rpc request; \end_deeper \layout Enumerate After successful connection done, the MDS will send the remote flag back to client for future usage in client side. \layout Subsection Filesystem access request \layout Enumerate Alice (from local or remote client) try to access a file in lustre \layout Enumerate If Alice is from remote client, MDS do uid/gid mapping; otherwise do nothing \layout Enumerate MDS obtain LSD item for Alice \layout Enumerate MDS perform permission check, based on LSD policies. \layout Enumerate MDS service process switch to this user's context \layout Enumerate MDS finish the file operation on behave of Alice. \layout Enumerate MDS service process switch back original context \layout Enumerate If Alice is from remote client, MDS do uid/gid reserve mapping if needed. \layout Enumerate MDS send reply. \layout Subsection Rpc after setuid/setgid/setgroups from local clients \layout Enumerate Alice calls setuid/setgid/setgroups to change her identity to Bob in local client node X; \layout Enumerate Bob (Alice in fact) tries to access a lustre file which belongs to Bob; \layout Enumerate MDS will verify the permission of Bob through local cached LSD configuration; \layout Enumerate MDS turns down or accept the file access request; \layout Subsection Rpc after setuid/setgid/setgroups from remote clients \layout Enumerate Alice calls setuid/setgid/setgroups to change her identity to Bob in remote client node Y; \layout Enumerate Bob (Alice in fact) tries to access a lustre file which belongs to Bob; \layout Enumerate MDS will find Bob is from the remote realm and in fact he is not real Bob; \layout Enumerate MDS turns down the file access request; \layout Subsection Update LSD configuration in MDS \layout Enumerate Lustre system administrator hopes to update current LSD option; \layout Enumerate The sysadmin uses the lsd update utility which will update the on-disk security database, and notify the changes of the LSD configuration to MDS; \layout Enumerate MDS re-fresh the cached LSD info through an upcall. \layout Subsection Revoke a local user \layout Enumerate Bob is able to access lustre filesystem \layout Enumerate Sysadmin remove Bob from the MDS's local user database, and flush in-kernel LSD cache for Bob. \layout Enumerate Bob will not be able to access MDS immediately \layout Subsection Revoke a remote user \layout Enumerate Alice of a remote client is mapped to MDS local user Bob. \layout Enumerate Alice is able to access lustre filesystem \layout Enumerate Sysadmin remove the mapping \begin_inset Quotes eld \end_inset Alice->Bob \begin_inset Quotes erd \end_inset from mapping database, and flush in-kernel mapping entry. \layout Enumerate Alice will not be able to access MDS immediately. \layout Enumerate If the mapping \begin_inset Quotes eld \end_inset anyone else -> Carol \begin_inset Quotes erd \end_inset exist in the mapping database, Alice could reconnect to MDS and then will be mapped to Carol. \layout Subsection Revoke a remote user (2) \layout Enumerate Alice of a remote client is mapped to MDS local user Bob. \layout Enumerate Alice is able to access lustre filesystem \layout Enumerate Sysadmin remove Bob from the MDS's local user database, and flush in-kernel LSD cache for Bob. \layout Enumerate Alice will not be able to access MDS immediately. \layout Enumerate If the mapping \begin_inset Quotes eld \end_inset anyone else -> Carol \begin_inset Quotes erd \end_inset exist in the mapping database, Alice could reconnect to MDS and then will be mapped to Carol. \layout Subsection 'ls -l' on remote client \layout Enumerate Suppose on a remote client, Alice's pricinpal group is AliceGrp; Bob's principal groups is BobGrp. \layout Enumerate there's several files on lustre: file_1 belongs to Alice:AliceGrp; file_2 belongs to Alice:BobGrp; file_3 belongs to Bob:AliceGrp; file_4 belongs to Bob:BobGrp; file_5 belongs to Bob:nllg; \layout Enumerate Alice do 'ls -l', output like this: file_1 belongs to Alice:AliceGrp; file_2 belongs to Alice:nllg; file_3 belongs to nllu:AliceGrp; file_4 belongs to nllu:nllg; file_5 belongs to nllu:nllg; \layout Enumerate Bob just login the client system, also do a 'ls -l', output like this: file_1 belongs to Alice:AliceGrp; file_2 belongs to Alice:Bobgrp; file_3 belongs to Bob:AliceGrp; file_4 belongs to Bob:BobGrp; file_5 belongs to Bob:nllg; \layout Enumerate Alice do 'ls -l' again, output is the same as Bob's list. \layout Enumerate Alice logout, then Bob do a 'ls -l' again, output like this: file_1 belongs to nllu:nllg; file_2 belongs to nllu:Bobgrp; file_3 belongs to Bob:nllg; file_4 belongs to Bob:BogGrp; file_5 belongs to Bob:nllg; \layout Subsection Chown on remote client \layout Enumerate Root user on a remote client want to change the owner of a file to Bob, while Bob didn't login(authenticated with lustre) yet. \layout Enumerate MDS can't find the mapping for the destinated uid, so return error. \layout Enumerate Bob login at that time. \layout Enumerate Root do the same chown again. \layout Enumerate MDS will grant the request, no matter what the original owner of this file is. \layout Subsection Chgrp on remote client \layout Enumerate Triditional chgrp on remote client is not allowed, since there's no clear group id mapping between local and remote database. so the group id on the remote client is not meaningful on the MDS. \layout Section Logic Specification \layout Subsection Specify nllu/nllg \layout Standard When client do mount, in addition to other parameter, user need supply with the IDs of nllu/nllg on this client, which will be sent to the MDS at connectin g time. If no nllu/nllg explicitly supplied, default values will be used. \layout Subsection Determine local or remote client \layout Standard Under GSS protection, user could explicitly supply the remote flag during mount time. MDS make decision as following order: \layout Itemize All permitted connections without GSS security are from local realm clients. \layout Itemize All connections with GSS security, if user supplied remote flag during mount, MDS will grant the flag as requested. \layout Itemize All connections with GSS/local_realm_kerberos are from local realm clients. \layout Itemize All connections with GSS/remote_realm_kerberos are from remote realm clients. \layout Standard Here we made the assumption that: kerberos's local/remote realm == lustre's local/remote realm. Later we might bring in more factors into this dicision making. \layout Standard GSS/Kerberos module is responsible to provide the information that the initial connect request whether has strong security; whether from remote kerberos realm. \layout Standard On MDS's, the per-client export structure has a flag to indicate local/remote of this client. Accordingly, each client has a similar flag, which is send back by MDS's after initial connection. \layout Subsection Handle local rpc request \layout Standard For each filesystem access request from client, we will get LSD for this uid at first. We then lookup in the cache, if not found or already invalid, issue a upcall to get it. If finally failed to get LSD(timeout or got an error), we simply deny this request. \layout Standard After obtained LSD, we also check whether the client intend to do setuid/setgid/ setgroups. If yes, check the permission bits in LSD, if not allow we also deny this request. The intention of setuid/setgid could be detected by compare the uid, gid, fsuid, fsgid sent by client, and the local authorized uid/gid. \layout Standard If setgroups is permitted: for root we'll directly use the supplementary groups array sent by client; for normal user we compare those sent by client with those in LSD, guarantee client only could reduce the array (can't add new ids which is not part of group array in LSD). \layout Standard If setgroups is not permitted, we simply use the supplementary group array provided by LSD. \layout Standard After all security context prepared as above, we switch it into process context, perform the actual filesystem operation. after finished, switch back the original context. send reply out to client. \layout Standard Later an special security policy is needed to allow RAW access by FID without a capability. This is used for analyzing audit logs, finding pathnames from fids (for recovery) etc. \layout Subsection Remote user mapping database \layout Standard There will be a user mapping configuration file on MDS, already defined in \begin_inset Quotes eld \end_inset functional specification \begin_inset Quotes erd \end_inset . MDS kernel will also maintain a cache of this mapping information. It is populated by upcall to server side gss daemon, along with the gss credential information. \layout Itemize The on-disk mapping database only described how user(principal) is mapped to an local uid, and don't need specify the gid mapping. \layout Itemize Both on-disk mapping database and kernel mapping cache should be able to allow map all other remote users to a certain local user. \layout Itemize On the MDS, the per-client structure will maintain this mapping cache. When a user from remote client get authenticated, we check the on-disk mapping database. If no mapping items for this user found, we'll deny this user. otherwise we record the target uid. \layout Itemize When a fs access request come from remote client, it contains the user's uid, gid on the remote client. Here we can establish mapping for uid and target uid. With target uid we can find the target gid from local user database (from LSD), thus we can also establish the mapping for gid and target gid. \layout Itemize With mapping we established above, we now do the mapping: replace the uid/gid in the rpc request with target uid/gid. If it request chown we also check & map the new owner id. \layout Itemize When reply populated and about to send back, we again check the mapping cache, and do the reverse mapping if in the case which return file attributes to clients. For those can't find the matched items, map them to nllu/nllg of this remote client. \layout Subsection Handle remote rpc request \layout Standard The overall process of handle remote rpc request is the same as for local user, except following: \layout Itemize For incoming request, firstly do the uid/gid mapping for the requestor; and do reserve mapping for the reply, as described above. \layout Itemize No setuid/setgid/setgroups intention is permitted, except we explicitly allow setuid-root in setxid database. And so we ignore the supplementary groups sent by client(if any), and simply use the one provided by LSD. \layout Itemize For chown request, we also do translation for the new owner id (already described above) according to the in-kernel mapping cache. It means the root user on remote client can't change owner of a file to a user which is not login yet. \layout Itemize Deny all chgrp request, since the group on remote client has no clear mapping on MDS's local user database (We also could choose allow this when the new group id showup in the in-kernel mapping cache, but it seems dosen't make much sense). So we probably need a special tool like \begin_inset Quotes eld \end_inset lfs chgrp \begin_inset Quotes erd \end_inset to perform chgrp on remote client, which will send out text name instead of translate to id locally. \layout Subsection Remote client cache flushing \layout Standard Anytime there might be inodes cached and their owner belongs to nllu/nllg. If a new user Alice get authenticated and she happens to be the owner of those inodes, we need to refresh those inode even if it's cache status is correct, otherwise Alice will find her files belong to others. Since we don't know whether a inode with nllu/nllg belongs to Alice or not, we must flush all of them. \layout Standard On MDS, a callback or similar event notification mechanism should be hooked into gss module. When a user authenticated at the first time, we should iterate through all the granted lock corresponding to this client, and revoke them selectively. Strictly speaking we only want to revoke those inodebits lock and the owner/gro up of their resource (inode) not show up in the in-kernel mapping database, but here we just flush all the inodebits locks, a cache is quickly re-populated - there are a maximum of 20-100 cached locks on clients at the moment. \layout Standard When Alice logs out of the client system, we also do the similar things: iterate through all the granted lock corresponding to this client, and revoke them selectively. Here we want to revoke those inodebits locks and the owner/group of their resource(inode) is Alice. We also could choose flush all of them like above case. \layout Subsection LSD upcall \layout Standard There is a general upcall-cache code which do upcall into user space, and cache data passed down in kernel, and also implemented timeout invalidation. Kernel LSD could simply be implemented as a instance of it. So it will be quite simple. \layout Standard A user-space tools should provide following functionality: \layout Itemize Accept uid as parameter \layout Itemize Obtian gid and supplementary groups id array which the uid belongs to, if failed just return error. \layout Itemize Obtian the setxid permission bits for this user on this NID from database. If not found a default bitset will be applied: (1) for local client: setuid/set gid is off, setgroups for root is off, setgroups for normal user is on; (2) for remote client: all of setuid/setgid/setgroups is off. \layout Itemize Pass all the collected information back to kernel by /proc. \layout Standard Since the upcall could happen concurrently, and admin could modified it at anytime, so a kind of read-write lock need to be done on the database file. \layout Subsection Recovery consideration \layout Standard All the code here should have minimal effect on recovery. After MDS's crash, security context will be established during connection time in recovery; and uid-mapping cache and LSD actually are \begin_inset Quotes eld \end_inset adaptive \begin_inset Quotes erd \end_inset , they will also be re-populated when handling related user's replay request during/after recovery. \layout Section State Management \layout Subsection configuration states \layout Itemize Client has a remote flag at mount time. \layout Itemize Remote clients must have nllu:nllg installed. it could simply be nobody:nobody. \layout Itemize MDS could have a remote-user mapping database which contains which principal at with client should be mapped to which local user. Without the database no remote client is allowed to connect. \layout Itemize MDS could have a security database which contains setxid permissions along with other security setting for each affected user. No such database then a default setting will be applied. \layout Subsection LSD entry states transition \layout Enumerate NEW: generated and submit to upcall \layout Enumerate READY: ready to serve \layout Enumerate INVALID: expired or error \layout Standard Requestor will initiate an NEW LSD entry; after upcall successfully fill in data it change to READY; if timeout or some error happen (e.g. not found in user database) during upcall it change to INVALID; a READY LSD will change to INVALID when expired, or flushed forcely by sysadmin, or MDS shutdown; an INVALID LSD will be soon destroied. \layout Standard No disk format changed. When a large number of users access lustre from all kinds of local/remote clients at the same time, MDS will have more CPU and memory overhead, especiall y for remote users. No special recovery consideration. \layout Section Alternatives \layout Subsection NFSv4 \layout Standard NFSv4 sends user and groups by name. \layout Section Focus of Inspection \layout Itemize Could this pass HP acceptance test? \layout Itemize Any is not reasonable? Any security hole? \layout Itemize Everything recoverable from MDS/client crash? \the_end