[rfc] 64-bit inode numbers

Rick Macklem rmacklem at uoguelph.ca
Sat Jun 25 14:04:22 UTC 2011


Benjamin Kaduk wrote:
> Hmm, several messages regarding AFS that I will try to address at
> once.
> 
> 
> On Fri, 24 Jun 2011, Kostik Belousov wrote:
> > On Thu, Jun 23, 2011 at 06:05:56PM -0400, Garance A Drosehn wrote:
> >> Consider the thread "Increasing the size of dev_t and ino_t" from
> >> freebsd-arch in 2002:
> >>
> >> http://docs.freebsd.org/mail/archive/2002/freebsd-arch/20020317.freebsd-arch.html
> >>
> >> In particular, this message by Robert Watson:
> >>
> >> http://docs.freebsd.org/cgi/getmsg.cgi?fetch=139853+0+archive/2002/freebsd-arch/20020317.freebsd-arch
> >>
> >> I just participated in an online conference for OpenAFS, and while
> >> it
> >> isn't exactly taking the world by storm, I keep thinking it would
> >> be
> >> useful if FreeBSD could map individual AFS volumes to unique dev_t
> >> identifiers. And given the way AFS is implemented (as a global
> >> filesystem
> >> with many cells all reachable at the same time), and given the way
> >> most
> >> sites deploy AFS (with thousands or tens-of-thousands of individual
> >> AFS
> >> volumes *per site*), that adds up to a lot of values for dev_t.
> >>
> >> The upcoming release of OpenAFS should include a working and pretty
> >> stable AFS client for FreeBSD, so having a larger dev_t would have
> >> a
> >> more immediate application than it did back in 2002.
> > Am I right that the issue is the uniqueness of the dev_t for each
> > AFS volume, as reported by stat(2) ?
> >
> > Shouldn't the AFS client synthesize the dev_t for each new volume
> > mounted ? It seems that the current 32bit dev_t would be enough,
> > since I do not expect to see hundreds of thousands of mounts
> > on an single system.
> 
> The current OpenAFS implementation only presents a single mountpoint,
> /afs, and does not really distinguish between different mounted
> volumes.
> This is not ideal, and we would like to be able to make each volume
> appear
> as a separate device if there's a good way to do so. The technical
> challenge of doing this while sill only having a single mount method
> for
> AFS is not something that I've looked at, there being more pressing
> issues
> on my plate.
> 
With a single mount point in the client (struct mount *), if the st_dev
remains the same throughout the mountpoint, then all st_ino's must be
unique (ie. no duplicate ino# == 2 or similar) or fts(3) complains
about cycles in the tree and gives up. (Shows up when you do "ls -lR".)

On the other hand, if st_dev changes within the single client mountpoint,
then the value of d_ino in the directory entry for it (I've heard of
this being referred to as the "mounted on inode#") must be different
than the st_ino reported for the object via stat(2) or getcwd() gets
confused, if I recall correctly.

> >
> > Please note that we do not guarantee dev_t stability across reboots
> > even
> > for real devices.
> 
> Hmm, this is somewhat annoying, as the AFS global namespace does
> provide a
> stable unique identifier for files/directories using a unique cell ID,
> volume ID, per-file ID, and uniquifier. Being able to directly use the
> cell/volume information for a dev_t would be quite convenient.
> 
> 
> 
> 
> 
> On Fri, 24 Jun 2011, Bruce Evans wrote:
> >
> > mnt_stat.f_fsid is generated from the dev_t, and tries to give
> > stability
> > across reboots. Otherwise, IIRC, nfs mounts break if the server is
> > rebooted. Not only the dev_t part, but other things in f_fsid,
> > depend
> > on the order of initialization, but the ids usually end up the same
> > if
> > you don't reconfigure much on the server.
> >
> > f_fsid also has a problem with uniqeness, but that is mainly because
> > it
> > wants to be unique when truncated to a 16-bit dev_t. dev_t is only
> > 16
> > bits in some versions of Linux, including in the FreeBSD i386 Linux
> > emulator (I can see traces of 32-bit dev_t in Linux-2.6.10 but not
> > in
> > the FreeBSD emulator).
> >
> > I hope AFS ids could be implemented like fsids and not need to
> > literally
> > match foreign ids, but if they are synthesized then they might be
> > harder
> > than fsids to keep invariant across reboots.
> 
> I'm not sure how one would have a chance of keeping things invariant
> across reboots other than to use the cell/volume IDs in some fashion.
> That said, the AFS client maintains its own copy of these unique IDs
> in
> the fs-specific vnode area, and should be able to talk to the server
> just
> fine if the fsids end up faked. Keeping the fake fsids consistent if a
> file goes in and out of the local cache may be a different issue,
> though.
> 
> 
> 
> 
> 
> On Fri, 24 Jun 2011, Rick Macklem wrote:
> 
> > Garance A Drosehn wrote:
> >> The AFS cell at RPI has approximately 40,000 AFS volumes, and each
> >> volume should have it's own dev_t (IMO). That's just counting the
> >> collection of AFS volumes which are on RPI file servers, and any
> >> user sitting on one computer could access AFS volumes which are
> >> made available by other sites (aka "AFS cells"). Most RPI users
> >> would only have access to maybe 1/4 of those volumes which exist
> >> at RPI, but we do know that individual users have run 'find' over
> >> the entire RPI cell looking for whatever they're looking for. I
> >> once did a run of 'md5deep' on the entire RPI cell, thanks to a
> >> symlink which I didn't realize was in my home directory!
> 
> We have almost 50,000 volumes in the athena cell, here.
> 
> >>
> > Note that it the value in mnt_stat.f_fsid that needs to be unique
> > w.r.t
> > other mount points in the machine. If AFS appears to be one mount
> > point in the FreeBSD client, then the only issue I know of is how
> > the client is expected to handle changes in dev_t within the mount
> 
> Er, how is the client expected to communicate these changes? As
> mentioned
> above, I believe we currently present only a single device and
> mountpoint,
> which is suboptimal. (Actually, it looks like we don't even initialize
> mnt_stat.f_fsid at all if I'm reading the current code correctly.
> Oops.)
> I would love to be able to present volume mountpoints as actually
> being
> mountpoints.
> 
> > point. fts(3) and friends will assume that it is a mount point
> > crossing when st_dev changes. It will then expect that the funny
> > rule that the d_ino in dirent will not be the same as st_ino.
> >
> > What I do for NFSv4 is sythesize the mnt_stat.f_fsid value and
> > return that as st_dev for the mounted volume until I see the fsid
> > returned by the server change. Below that point, I return the fsid
> > from the server as st_dev so long as it isn't the same as the
> 
> I think I'm confused. You're ... walking a directory heirarchy, and
> return a fake st_dev value but hold onto the fsid value from the
> server,
> then when the fsid from the server changes (due to a ... different NFS
> mount?), start reporting that new fsid and throw away the fake st_dev
> value? Can you point me at the code that is doing this?
> 
> > synthesized one. That way, fts(3) and friends figure out the mount
> > point crossings within the server.
> >
> > "ls -lR" will usually find problems if this is broken.
> >> So one person can easily trigger the access of 10,000 AFS volumes
> >> on one computer using one command. That might sound terrifying if
> >> you imagine it as being 10,000 NFS mounts, but accessing AFS
> >> volumes
> >> isn't the same amount of work as auto-mounting NFS filesystems.
> >> So ignore whatever problems you might expect to see with 10,000
> >> filesystems mounted on one computer. Just realize that it is very
> >> easy for a single user to access tens of thousands of AFS volumes
> >> from one computer, and it would be "most correct" (programming
> >> wise)
> >> if all of those AFS volumes were to get a unique value for dev_t.
> >> And of course it's even easier for a remote-access system to access
> >> tens-of-thousands of AFS volumes, since it would have a few dozen
> >> users logged in at the same time.
> >>
> 
> 
> 
> I guess, at the end of the day, it's not clear to me what OpenAFS
> should
> do when we finally get around to exposing AFS volume mountpoints as
> device
> mountpoints to userland. Reusing existing globally-unique AFS ID
> information would be nice, but how to cleanly transform that to a
> smaller
> unique ID for the particular machine in question?
> 
> -Ben Kaduk


More information about the freebsd-fs mailing list