Which filepath-join behavior is implemented for relative and absolute paths as arguments?

✍️ Written on 2024-05-03 in 3109 words.
Part of reflection cs software-development programming-languages

Update 2024-05-04T2202: ruby was accidentally assigned to python’s behavior set. Thanks to Karl for pointing it out!

Motivation

Many systems provide functionalities to join file paths. Specifically shells and filesystem APIs make such functionalities accessible to the user. So if we join foo and bar, we want to get foo/bar on a UNIX system. One should not implement this behavior with a common string library, because joining foo/ with bar should still be foo/bar and not foo//bar. Simultaneously, we need to take operating-system specific behavior into account. Win32 does not use a slash, but a backslash to separate file path components. Furthermore UNIX names starting with a slash “absolute filepaths” but Windows uses Universal Naming Convention (UNC) with prefixes like C:\. A library joining filepaths should handle absolute filepaths according to the filesystem in question.

Now I got interested in the question: what should the behavior of foo joining /bar be?

The motivating behavior

golang implements the following behavior:

package main

import (
    "fmt"
    "path/filepath"
)

func main() {
    fmt.Println(filepath.Join("foobar", "/etc/password"))
    // gives "foobar/etc/password"
}

python implements the following behavior:

import os.path
print(os.path.join('foobar', '/etc/passwd'))
# gives "/etc/passwd"

And this behavior is confusing to many users:

Independent of preferences, one needs to consider actual vulnerabilities emerging. Indeed, there are two recent ones motivating this evaluation:

Evaluation per programming language

rust implements python’s behavior:

fn main() {
    let mut p  = std::path::Path::new("foobar");
    println!("{}", &p.join("/etc/passwd").display());
    // gives "/etc/passwd"
}

C++'s filesystem API (since C++17) implements python’s behavior and refers to POSIX:

std::filesystem::path("foobar") / "/etc/passwd";
// the result is "/etc/passwd" (replaces)

Java declares the behavior as “provider specific” and thus the situation remains unclear, because I don’t have a JVM at hand to try it out.

Path.Combine on .NET implements python’s behavior:

“If the one of the subsequent paths is an absolute path, then the combine operation resets starting with that absolute path, discarding all previous combined paths.”

Dart & Flutter provide python’s behavior. D implements python’s semantics as well as documented by “If any of the path segments are absolute (as defined by isAbsolute), the preceding segments will be dropped”.

buildPath("/foo", "/bar")
// /bar

Tcl file join is also on python’s side: “If a name is an absolute path, all previous arguments are discarded and any subsequent arguments are then joined to it.”

And then finally, we have more supporters of golang’s behavior:

  • ruby implements golang’s behavior:

    puts "Hello World"
    p File.join("foobar", "/etc/passwd")
    #=> "foobar/etc/passwd"
  • Nim with joinPath("usr/", "/lib") as "usr/lib"

  • FreePascal fails to mention the implemented behavior in ConcatPaths, but I tried it out and it sides with golang. Unlike other APIs, it specifically provides functions such as ExcludeLeadingPathDelimiter.

  • PowerShell’s join-path implements golang’s behavior:

  • zig sides with golang as well

Where does this behavior come from?

When I heard of it, I thought that these are common POSIX semantics. I was wrong. These strings are not passed down some API, but the behavior is implemented by programming languages:

And I was not able to reproduce the behavior in libc/syscalls. First, I tried chdir:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <linux/limits.h>

int main(int argc, char* argv[])
{
    chdir("foobar//etc");

    char cwd[PATH_MAX];
    if (getcwd(cwd, sizeof(cwd)) != NULL) {
        printf("Current working dir: %s\n", cwd);
    } else {
        return 1;
    }

    return 0;
}

It prints Current working dir: /tmp when it is run inside /tmp. It seems to reject the provided filepath. The same is true for fopen:

fopen("main.c//tmp/main.go", "r")

… returns NULL. Maybe I need to use less libc and more POSIX:

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main(int argc, char* argv[])
{
    int fd = open("main.c//tmp/main.go", O_EXCL);
    printf("%d\n", fd);
    return 0;
}

… but also this pure open example prints -1 indicating an error. Finally realpath also returns NULL:

#include <stdio.h>
#include <fcntl.h>
#include <limits.h>
#include <stdlib.h>

int main(int argc, char* argv[])
{
    char result[PATH_MAX];
    char *ret = realpath("main.c//tmp/main.go", result);
    printf("%p\n", ret);
    return 0;
}

In the end, POSIX specifies how to traverse/resolve a filepath, but there is no functionality to join them. Thus, there was likely no necessity to specify this behavior.

If POSIX is not responsible and programming languages like python and rust implement it themselves, when did it start? When did python implement the behavior first?

  • python 0.9.1 did not yet have a path.join function.

  • python 1.2 provides path.join (in posixpath.py) with the following implementation mentioning the behavior explicitly:

    # Join two pathnames.
    # Ignore the first part if the second part is absolute.
    # Insert a '/' unless the first part is empty or already ends in '/'.
    
    def join(a, b):
            if b[:1] == '/': return b
            if a == '' or a[-1:] == '/': return a + b
            # Note: join('x', '') returns 'x/'; is this what we want?
            return a + '/' + b
  • python 1.5.2 accepts a variadic number of arguments and continues implementing this behavior:

    # Join pathnames.
    # Ignore the previous parts if a part is absolute.
    # Insert a '/' unless the first part is empty or already ends in '/'.
    
    def join(a, *p):
        """Join two or more pathname components, inserting '/' as needed"""
        path = a
        for b in p:
            if b[:1] == '/':
                path = b
            elif path == '' or path[-1:] == '/':
                path = path + b
            else:
                path = path + '/' + b
        return path

Ok, so we know that python 1.2 already had this behavior. Does I explain why? No.

By the way, PEP 428 from 2012 introduced an object-oriented API for filesystem paths in python. Did they change the behavior?

from pathlib import Path
Path('foobar') / '/etc/passwd'
# gives "PosixPath('/etc/passwd')"

No, but simultaneously, a different behavior can easily be achieved:

from pathlib import Path
child = Path('/etc/passwd')
Path('foobar') / child.relative_to(child.anchor)
# gives "PosixPath('foobar/etc/passwd')"

The question of expected behavior

What is the expected behavior in the end? In 2014 in rust issue 16507, a user writes:

I would expect that a.join(&b) would return /foo/bar, however it returns /bar. Given my experience w/ path joining in Ruby and Go, I would expect that join concats two paths and does some normalization to remove double slashes, etc…

The user got 22 thumbs-ups for this initial issue description. lillyball counterargues:

I agree with @aturon. The only sensible operation when joining an absolute path onto some other path is to get the absolute path back. Doing anything else is just weird, and only makes sense if you actually think of paths as strings, where "join" is "append, then normalize". I do not understand why Go’s path.Join behaves in this way, although they are actually taking strings as arguments.

The C++ community also seemed to be divided, because diverging arguments have been raised during the definition process of C++17. In the end python’s behavior was still implemented for POSIX systems. First, some arguments in favor of python’s behavior were raised (2014):

This means that, for example, "c:\x" / "d:\y" gives "c:\x\d:\y", and that "c:\x" / "\\server\share" gives "c:\x\\server\share". This is rarely, if ever, useful.

An alternative interpretation of p1 / p2 could be that it yields a path that is the approximation of what p2 would mean if interpreted in an environment in which p1 is the starting directory. Under this interpretation, "c:\x" / "d:\y" gives "d:\y", which is more likely to match what was intended.

Later the opposite behavior was suggested and formalized (2017):

“Passing a path that includes a root path (name or directory) to path.operator/=() simply incorporates the root path into the middle of the result, changing its meaning drastically. LWG 2664 proposes disallowing a path with a root name (but leaves the root directory possibility untouched); US 77/CA 6 (via P0430R0) objects and suggests instead making the same case implementation-defined. (P0430R1 drops the matter in favor of this issue.)”

[…]

// On POSIX,
path("foo") / "";     // yields "foo/"
path("foo") / "/bar"; // yields "/bar"
// On Windows, backslashes replace slashes in the above yields

Let us summarize the statistical data from above:

behavior count example output

python’s

6

/etc/passwd

golang’s

6

foobar/etc/passwd

unclear

1 (Java)

My personal opinion on this is the following:

  • I think joining means taking equivalent elements and concatenating them to work together (c.f. python’s str.join)

  • The fundamental problem is that foo/bar and /foo/bar are not equivalent elements at all. A relative and an absolute filepath carry different semantics. A relative filepath refers to different elements depending on your current location compared to an absolute filepath which is fixed. In terms of type systems, one might want to model them as two different types (because different operations can be done).

  • Joining a relative and an absolute path is a hazard, because the absolute path dictates “start here”. Since joining happens from left-to-right (in our LTR writing systems), python’s behavior makes sense and corresponds to the semantics of relative/absolute file paths.

  • Never trust user input! If you actually allow users to specify file paths (request URLs in webservers, arguments in command line tools to fetch data from production systems), you need to verify which arguments are allowed and check that. Certainly the standard library should help you with it, but always read the corresponding documentation to match your expectations with reality.

  • Apparently, CVEs appeared and people get it actually wrong. This is certainly an argument in favor of golang’s behavior. However, whereas I consider C:\Windows join D:\Media resulting in D:\Media surprising, I personally consider C:\Windows\Media arbitrary. In the end, I think an error is the only way to go.

  • I think we just don’t get the idea of file paths wrong. We consider relative and absolute filepaths as equivalent even though they are not. Maybe shells just gave us the wrong idea that everything is just a string anyways.

BTW, werkzeug (and thus flask utilizing it) has built its own safe_join function.

Conclusion

My original intention for investigating this topic was to determine who came up with this behavior originally. I was convinced it comes from the POSIX world, but I could not find any supporting evidence. I was wrong. The idea seems to come from the early days of programming languages (before 1990) and the actual origin remains unclear to me.

Thus I designed the blog article around the question “which behavior is implemented?” and also “what is the expected behavior?”. And everything, except throwing an error, seems insane to me now.