3/03/2014

03-03-14 - Windows Links

I wrote a deduper in Oodle today. I was considering making the default action be to replace duplicate files with a link to the original.

I wasn't sure whether to use "hard" or "soft" links, so I did a little research.

In Windows a "hard link" means that multiple file names all point at the same file data. It's a bit of a misnomer, it's not really a "link" to the original. There is no "original" or base file - all instances of a hard link are equal peers.

A "soft link" is just an OS-level shortcut. There is an original base file, and the soft links point at it.

Both are ridiculously broken concepts and IMO should almost never be used.

With "hard links" the problem is that if you accidentally edit any of the links, you have editted the master data. If you did not intend that, you may have fucked up something severely.

Hard links are reasonable *if* the files linked are read-only (and somehow actually kept read-only, not just attrib'ed away).

The problem with "soft links" is that the links are not protected; if you rename or move or delete the original file, all the links are broken, and again you may have just severely fucked something up.

The big problem is that you get no warning in either case. Clearly what you want when you try to rename a file which has a bunch of soft links pointing at it is some kind of dialog that says "hey all these links point at this file, do you really want to rename it and break the links? or should I update the links to point at the new name?". Similarly with hard links, obviously what you want is some prompt like "hey if you modify this, so you want these hard links to see the new version or the old version?".

Now obviously you can't solve this problem in general without user prompts. But I believe that a refcounted copy-on-write link would have been a much more robust and safe solution. Open for write should have done a COW by default unless you passed a special flag to indicate you intend to edit shared data.

Even ignoring the fundamental logic of how links work, there are some very bad practical issues for links in windows.

1. Soft links show a file size of 0 in the dir enumeration file info. This breaks the assumption that most apps make that the file size they get from the dir enumeration will be the same as the file size they get if they open that file handle and ask for its size. It can also screw up enumerations that are just trying to skip zero-size files.

Hard link file sizes are out of date. If the file data is modified, only the directory entry for the one that was used to modify the data is updated. All other links still have the old file sizes, mod times, etc.

2. Hard links break the assumption that saving to a file is the same as saving to a temp and then renaming onto the file. Many apps may or may not use the "write to temp then rename" pattern; what you get is massively different results in a very unexpected way.

3. Mod times are hosed. In general attributes are hosed; neither type of link reflects the attributes of the actual file data in the link - until they are opened, then they get updated. Mod times are particularly bad because many apps use them to detect changes, and with links the file data can be changed but the mod time won't reflect it.

Dear lord. So non-robust.

7 comments:

Sylvain V said...

Hi Charles,

Being used to the hard/soft links in Unix, I'd have a few comment from a different perspective:
- hard link edition creating a new copy: I'd actually expect editing one hard link to edit all hard-link's content. If I use hard links to store one file in different classifications (ex: according to date and to use, for a picture or for an office doc) I'd want all versions to get updated at the same time.

- deleting original making soft links not work: that's unfortunately how soft links work on all O/S.

I'd had a few drawbacks to hard links in pre-Win8 area (afaik win7 still suffers from these):
- deleting one hard links will delete all links to the file, not only remove the link.
- moving the original file or directory will break all hard links.

Also on many Unix implementation:
- cannot create hard links across different disks/partitions - SunOs is the only Unix I know which support that.

Unknown said...

I'm thankful that in order to cause messes with symbolic links you need SeCreateSymbolicLinkPrivilege rights, which means that most applications running as mortals don't get the chance to mess up.

Boost recently adopted hard/symbolic links for building their new modularised repositories. Turns out that Git does amusing things to/through linked directories/files.

Another fun thing is that linking tends to render things unusable across a CIFS share as they don't resolve properly on client machines.

Sly: You should try AFS some day. There you can't hardlink across directory boundaries, breaking all sorts of builds and software.

Fabian 'ryg' Giesen said...

Don't use links for dedup. Links are about having several names for the *same* object. Dedup is about sharing *equivalent* objects (same value, not necessarily equal).

Windows essentially copies the terms and loose semantics from Unix, but the details are broken (as usual).

Unix: the actual block describing the physical file (size, pointers to data blocks, mod times, permissions etc.) is an inode.

Directories store file names and inode numbers, and not much else (the stuff that "readdir" returns). Every name for a file = a hard link. Thus a file with multiple hard links is just a file with multiple names; as far as the OS is concerned, there is no "master copy" or anything like that. Because all the actual file metadata is stored in the inode, none of it goes out of sync no matter which name you address it by. (Hence also the "no hard links across FSes" limitation - the whole referring to things by inode number doesn't make sense across filesystems.)

So that's close to the Windows thing, but at least it has all the file-specific data in the inode so there's one copy; Windows having no inodes and putting everything into directory/MFT entries makes that suck even more.

Sym links/soft links are basically the same everywhere: *weak* references, and to a file *name* not a file.

In both Windows and Linux, most objects the kernel deals with are mapped (or at least mappable) into the file namespace, and the primary use of links is as a mechanism to introduce "synonyms" in that namespace. Their semantics don't work for dedup because that's not what they're trying to do.

Fabian 'ryg' Giesen said...

And now for the opinion part: Links are a giant mess absolutely everywhere, not just because of non-intuitive semantics (where present), but also because most programs think that file names are unique identifiers and all kinds of things start to break in weird ways once that's not true anymore.

In essence, everything to do with links (*especially* on Windows) falls into the crack between what most programs think the FS semantics are, and what they actually are. (See also: NTFS streams, junction points)

By contrast, dedup is a fairly safe thing, simply because it doesn't affect observable behavior from the app side (other than second-order stuff like how much space is free after creating a new file, but with FS compression that's fuzzy anyway).

Anonymous said...

I recently found another problem with soft links -- also known as junctions. If you have a soft link to another drive then Windows Backup will fail. Yep, fail. With a cryptic hexadecimal error.

I hit this after upgrading from a big hard drive to a small boot SSD plus big hard drive. The obvious thing to do is have c:\users\public\music point to f:\public\music, so no configurations need to be updated. This worked perfectly, except that backups failed.

Damn.

cbloom said...

A very common problem with directory links is any time you have a loop, lots of apps will infinitely chase that loop.

SteveP said...

Charles,

Here are my notes on this topic. I have to refer to them every once in a while as I try to forget them as soon as possible:

LINK notes:

HardLinks are alternate names for a file (introduced in Windows NT4, NTFS 1.2/4.0)
Juntions are a link to another directory (introducted in Windows 2000, NTFS 3/5.0)
SymLinks are links to directories/file (introduced in Windows Vista, NTFS 3.1/6.0

Junctions:
Implemented as a Reparse Point
Can link to another volume but not a remote share

SymLinks:
Implemented as a Reparse Point
Created with CreateSymbolicLink (requires privilege)
Can span volumes and used on remote shares


HardLink:
Created with CreateHardLink (requires privilege)
Cannot span volumes
GetFileInformationByHandle can be used to get a link count
FindFirstFileNameW / FindNextFileNameW can enumerate link names (Vista+)
DeleteFile decrements the link count and removes the data at 0


Reparse Points
GetVolumeInformation bit for FILE_SUPPORTS_REPARSE_POINTS flag
Use DeviceIoControl and FSCTL_ operations for SET/GET/DELETE
GetFileAttributes bit is FILE_ATTRIBUTE_REPARSE_POINT
FindFirstFile can be used to determine the kind of Reparse Point (dwReserved0)
CreateFile with FILE_FLAG_OPEN_REPARSE_POINT can open directly

old rants