I recently discovered a couple of interesting bugs within Ruby's implementation of extracting substrings. The bugs result in incorrect and surprising return values that could violate assumptions made by Ruby application developers, potentially leading to security vulnerabilities.
For background information, Ruby provides substring retrieval with the String#slice method, which is aliased as the String#[] method. The method accepts an index argument and an optional length argument. For this post, we are only focusing on the case when Integer objects are supplied, although Range, Regexp and String are also accepted. The String#slice! method also exists which returns the extracted substring but also mutates the original string by removing the substring. All of these methods, as well as String#size, operate on characters. This is in contrast with String#byteslice and String#bytesize that operate on bytes. Further details and examples can be found in the Ruby documentation for String.
String#slice
String#[]
String#slice!
String#size
String#byteslice
String#bytesize
The first bug I have found violates the assumption that nil is returned for all indexes outside of -size to size-1. To observe that this assumption is false, you can run the following commands:
nil
-size
size-1
irb(main):001:0> ("\x80" + "A"*7).slice(-9223372036854775806, 1) => nil irb(main):002:0> ("\x80" + "A"*7).slice(-9223372036854775807, 1) => nil irb(main):003:0> ("\x80" + "A"*7).slice(-9223372036854775808, 1) => ""
The difference between returning nil and an empty string may not seem significant, but nil and false are the only falsey values in Ruby. This means empty strings (and even the Integer 0) are truthy in Ruby.
false
0
irb(main):001:0> !!nil => false irb(main):002:0> !!"" => true
The second bug I have found violates the assumption that the length argument restricts the maximum size of the returned string. To observe that this assumption is false, you can run the following commands:
irb(main):001:0> ("A"*37 + "\x80").slice(-1, 1).size => 1 irb(main):002:0> ("A"*38 + "\x80").slice(-1, 1).size => 1 irb(main):003:0> ("A"*39 + "\x80").slice(-1, 1).size => 2
To understand how these bugs exist we have to look at string.c within the Ruby codebase. The relevant code is shown below, reordered to be clearer when read top to bottom.
string.c
void Init_String(void) { [...] rb_define_method(rb_cString, "slice", rb_str_aref_m, -1); [...] } static VALUE rb_str_aref_m(int argc, VALUE *argv, VALUE str) { [...] long beg = NUM2LONG(argv[0]); long len = NUM2LONG(argv[1]); return rb_str_substr(str, beg, len); [...] } VALUE rb_str_substr(VALUE str, long beg, long len) { return str_substr(str, beg, len, TRUE); } static VALUE str_substr(VALUE str, long beg, long len, int empty) { char *p = rb_str_subpos(str, beg, &len); if (!p) return Qnil; if (!len && !empty) return Qnil; [...] } char * rb_str_subpos(VALUE str, long beg, long *lenp) { [...] long len = *lenp; [...] long blen = RSTRING_LEN(str); rb_encoding *enc = STR_ENC_GET(str); char *p, *s = RSTRING_PTR(str), *e = s + blen; [...] if (len < 0) return 0; [...] if (single_byte_optimizable(str)) { [...] if (beg < 0) return 0; [...] } [...] if (beg < 0) { if (len > -beg) len = -beg; if (-beg * rb_enc_mbmaxlen(enc) < RSTRING_LEN(str) / 8) { beg = -beg; while (beg-- > len && (e = rb_enc_prev_char(s, e, e, enc)) != 0); p = e; if (!p) return 0; while (len-- > 0 && (p = rb_enc_prev_char(s, p, e, enc)) != 0); if (!p) return 0; len = e - p; goto end; [...] end: *lenp = len; RB_GC_GUARD(str); return p; }
The trigger of ("\x80" + "A"*7).slice(-9223372036854775808, 1) for the first bug can be dissected as follows:
("\x80" + "A"*7).slice(-9223372036854775808, 1)
"\x80" is used to ensure the condition if (single_byte_optimizable(str)) { is not taken. As long as the string contains at least a single byte with the high bit set, so "\x80" to "\xff", the if condition will not be taken.
"\x80"
if (single_byte_optimizable(str)) {
"\xff"
"A"*7 is required to ensure the condition if (-beg * rb_enc_mbmaxlen(enc) < RSTRING_LEN(str) / 8) is taken. The left hand side will be 0 so we need to ensure the right hand side is at least 1, which an extra 7 characters is the lowest that achieves this.
"A"*7
if (-beg * rb_enc_mbmaxlen(enc) < RSTRING_LEN(str) / 8)
-9223372036854775808 being a negative value ensures the if (beg < 0) { condition is taken. The value needs to be the value of LONG_MIN for the system and compiler. It is chosen as when negated it will remain a negative number, which will affect all three instances of -beg. This results in len unexpectedly having a negative value. This then results in rb_enc_prev_char never being called in either of the while conditions, which would have otherwise failed triggering an early return 0. As p is initialised to e, then never adjusted by rb_enc_prev_char, len gets set to 0 by len = e - p;. This len of 0 ultimately causes the empty string to be returned.
-9223372036854775808
if (beg < 0) {
LONG_MIN
-beg
len
rb_enc_prev_char
return 0
p
e
len = e - p;
The explicit length argument to slice is required as without it, an alternate path to str_substr is taken via rb_str_aref which sets the argument empty to FALSE, which then causes str_substr to return nil.
str_substr
rb_str_aref
empty
FALSE
The trigger of ("A"*39 + "\x80").slice(-1, 1) for the second bug can be dissected as follows:
("A"*39 + "\x80").slice(-1, 1)
"A"*39 is required to ensure the condition if (-beg * rb_enc_mbmaxlen(enc) < RSTRING_LEN(str) / 8) { is taken. The left hand side will be 4 with an index of -1, requiring the right hand side to be at least 5, which an extra 39 characters being the lowest that achieves this.
"A"*39
if (-beg * rb_enc_mbmaxlen(enc) < RSTRING_LEN(str) / 8) {
"\x80" is used to once again ensure the condition if (single_byte_optimizable(str)) { is not taken. An extra restriction applies for this bug, where the byte must also be false for utf8_islead in enc/utf_8.c. This restricts the byte to having the high bit set, but the second highest bit not set, meaning values from "\x80" to "\xbf" are suitable. The byte must also be positioned at the corresponding position of the index.
utf8_islead
enc/utf_8.c
"\xbf"
A negative index is required for the condition if (beg < 0) { to be taken. As the index becomes more negative, the size of the string must also increase. An index of -2 will only return more than one character when the string is at least 72 characters in length.
This time the length argument is not required, which means the bug is reproducible with ("A"*39 + "\x80").slice(-1)
("A"*39 + "\x80").slice(-1)
It is also worth mentioning that the delta between the length argument and the size of what is actually returned is not restricted to 1 and can instead be arbitrarily large. The following shows a length argument of 1 but a string of size 100 being returned:
irb(main):001:0> ("\xa0" * 39).slice(-1, 1).size => 1 irb(main):002:0> ("\xa0" * 100).slice(-1, 1).size => 100
I have confirmed both bugs can be triggered on versions 2.0.0 up to and including the latest versions (3.0.7, 3.1.5, 3.2.4, 3.3.1). Even older versions do have similar implementations and are likely also buggy in similar ways.
When triggering either bug I used strings with a UTF-8 encoding, which can be checked with encoding method. The encoding can be controlled in many different ways, such as using the encoding directive magic comment, calling .force_encoding("utf-8") on a string object or setting the locale. Since commit 8e49e25b59 in June 2020, the official Docker images have explicitly set the locale to a UTF-8 encoding.
encoding
.force_encoding("utf-8")
Finally, the following Dockerfile can be used to compare branch coverage within string.c for two different Ruby one-liners.
FROM fedora:39 RUN dnf update -y && \ dnf install -y '@Development tools' redhat-rpm-config autoconf \ openssl-devel zlib-devel libyaml ruby libffi libyaml-devel \ libffi-devel vim-enhanced && \ dnf clean all && \ git clone --branch v3_3_0 --depth 1 https://github.com/ruby/ruby && \ cd /ruby && \ ./autogen.sh && \ ./configure --enable-gcov && \ make "-j$(nproc)" && \ cat <<"EOF" > /entrypoint.sh cd /ruby find . -name \*.gcda -exec rm {} \; ./ruby --disable-all -e "$1" gcov --stdout -c string.c > /gcov-left find . -name \*.gcda -exec rm {} \; ./ruby --disable-all -e "$2" gcov --stdout -c string.c > /gcov-right vimdiff -c 'windo set wrap' /gcov-left /gcov-right EOF ENTRYPOINT ["/bin/bash", "/entrypoint.sh"]
It can be used as follows after placing the above in a file named Dockerfile:
Dockerfile
$ docker build -t ruby-branch-cov-vimdiff . $ docker run -it ruby-branch-cov-vimdiff '"a"[2,1]' '"a"[-2,1]'