Ruby's String Slice is Broken
Luke Jahnke4 November 2024

I recently discovered a couple of interesting bugs within Ruby's implementation of extracting substrings. The bugs result in incorrect and surprising return values that could violate assumptions made by Ruby application developers, potentially leading to security vulnerabilities.

For background information, Ruby provides substring retrieval with the String#slice method, which is aliased as the String#[] method. The method accepts an index argument and an optional length argument. For this post, we are only focusing on the case when Integer objects are supplied, although Range, Regexp and String are also accepted. The String#slice! method also exists which returns the extracted substring but also mutates the original string by removing the substring. All of these methods, as well as String#size, operate on characters. This is in contrast with String#byteslice and String#bytesize that operate on bytes. Further details and examples can be found in the Ruby documentation for String.

The first bug I have found violates the assumption that nil is returned for all indexes outside of -size to size-1. To observe that this assumption is false, you can run the following commands:

irb(main):001:0> ("\x80" + "A"*7).slice(-9223372036854775806, 1)
=> nil

irb(main):002:0> ("\x80" + "A"*7).slice(-9223372036854775807, 1)
=> nil

irb(main):003:0> ("\x80" + "A"*7).slice(-9223372036854775808, 1)
=> ""

The difference between returning nil and an empty string may not seem significant, but nil and false are the only falsey values in Ruby. This means empty strings (and even the Integer 0) are truthy in Ruby.

irb(main):001:0> !!nil
=> false

irb(main):002:0> !!""
=> true


The second bug I have found violates the assumption that the length argument restricts the maximum size of the returned string. To observe that this assumption is false, you can run the following commands:

irb(main):001:0> ("A"*37 + "\x80").slice(-1, 1).size
=> 1

irb(main):002:0> ("A"*38 + "\x80").slice(-1, 1).size
=> 1

irb(main):003:0> ("A"*39 + "\x80").slice(-1, 1).size
=> 2

To understand how these bugs exist we have to look at string.c within the Ruby codebase. The relevant code is shown below, reordered to be clearer when read top to bottom.

void
Init_String(void)
{
[...]
    rb_define_method(rb_cString, "slice", rb_str_aref_m, -1);
[...]
}

static VALUE
rb_str_aref_m(int argc, VALUE *argv, VALUE str)
{
[...]
            long beg = NUM2LONG(argv[0]);
            long len = NUM2LONG(argv[1]);
            return rb_str_substr(str, beg, len);
[...]
}

VALUE       
rb_str_substr(VALUE str, long beg, long len)
{       
    return str_substr(str, beg, len, TRUE);
}

static VALUE
str_substr(VALUE str, long beg, long len, int empty)
{
    char *p = rb_str_subpos(str, beg, &len);

    if (!p) return Qnil;
    if (!len && !empty) return Qnil;
[...]
}

char *
rb_str_subpos(VALUE str, long beg, long *lenp)
{
[...]
    long len = *lenp;
[...]
    long blen = RSTRING_LEN(str);
    rb_encoding *enc = STR_ENC_GET(str);
    char *p, *s = RSTRING_PTR(str), *e = s + blen;
[...]
    if (len < 0) return 0;
[...]
    if (single_byte_optimizable(str)) {
[...]
            if (beg < 0) return 0;
[...]
    }
[...]
    if (beg < 0) {
        if (len > -beg) len = -beg;
        if (-beg * rb_enc_mbmaxlen(enc) < RSTRING_LEN(str) / 8) {
            beg = -beg;
            while (beg-- > len && (e = rb_enc_prev_char(s, e, e, enc)) != 0);
            p = e;
            if (!p) return 0;
            while (len-- > 0 && (p = rb_enc_prev_char(s, p, e, enc)) != 0);
            if (!p) return 0;
            len = e - p;
            goto end;
[...]
  end:
    *lenp = len; 
    RB_GC_GUARD(str);
    return p;
}


The trigger of ("\x80" + "A"*7).slice(-9223372036854775808, 1) for the first bug can be dissected as follows:

The trigger of ("A"*39 + "\x80").slice(-1, 1) for the second bug can be dissected as follows:

It is also worth mentioning that the delta between the length argument and the size of what is actually returned is not restricted to 1 and can instead be arbitrarily large. The following shows a length argument of 1 but a string of size 100 being returned:

irb(main):001:0> ("\xa0" * 39).slice(-1, 1).size
=> 1

irb(main):002:0> ("\xa0" * 100).slice(-1, 1).size
=> 100

I have confirmed both bugs can be triggered on versions 2.0.0 up to and including the latest versions (3.0.7, 3.1.5, 3.2.4, 3.3.1). Even older versions do have similar implementations and are likely also buggy in similar ways.

When triggering either bug I used strings with a UTF-8 encoding, which can be checked with encoding method. The encoding can be controlled in many different ways, such as using the encoding directive magic comment, calling .force_encoding("utf-8") on a string object or setting the locale. Since commit 8e49e25b59 in June 2020, the official Docker images have explicitly set the locale to a UTF-8 encoding.

Finally, the following Dockerfile can be used to compare branch coverage within string.c for two different Ruby one-liners.

FROM fedora:39

RUN dnf update -y && \
  dnf install -y '@Development tools' redhat-rpm-config autoconf \
    openssl-devel zlib-devel libyaml ruby libffi libyaml-devel \
    libffi-devel vim-enhanced && \
  dnf clean all && \
  git clone --branch v3_3_0 --depth 1 https://github.com/ruby/ruby && \
  cd /ruby && \
  ./autogen.sh && \
  ./configure --enable-gcov && \
  make "-j$(nproc)" && \
  cat <<"EOF" > /entrypoint.sh
    cd /ruby
    find . -name \*.gcda -exec rm {} \;
    ./ruby --disable-all -e "$1"
    gcov --stdout -c string.c > /gcov-left
    find . -name \*.gcda -exec rm {} \;
    ./ruby --disable-all -e "$2"
    gcov --stdout -c string.c > /gcov-right
    vimdiff -c 'windo set wrap' /gcov-left /gcov-right
EOF

ENTRYPOINT ["/bin/bash", "/entrypoint.sh"]

It can be used as follows after placing the above in a file named Dockerfile:

$ docker build -t ruby-branch-cov-vimdiff .
$ docker run -it ruby-branch-cov-vimdiff '"a"[2,1]' '"a"[-2,1]'
« Back to homepage