1862 Commits

Author SHA1 Message Date
Étienne Barrié
12be40ae6b Implement chilled strings
[Feature #20205]

As a path toward enabling frozen string literals by default in the future,
this commit introduce "chilled strings". From a user perspective chilled
strings pretend to be frozen, but on the first attempt to mutate them,
they lose their frozen status and emit a warning rather than to raise a
`FrozenError`.

Implementation wise, `rb_compile_option_struct.frozen_string_literal` is
no longer a boolean but a tri-state of `enabled/disabled/unset`.

When code is compiled with frozen string literals neither explictly enabled
or disabled, string literals are compiled with a new `putchilledstring`
instruction. This instruction is identical to `putstring` except it marks
the String with the `STR_CHILLED (FL_USER3)` and `FL_FREEZE` flags.

Chilled strings have the `FL_FREEZE` flag as to minimize the need to check
for chilled strings across the codebase, and to improve compatibility with
C extensions.

Notes:
  - `String#freeze`: clears the chilled flag.
  - `String#-@`: acts as if the string was mutable.
  - `String#+@`: acts as if the string was mutable.
  - `String#clone`: copies the chilled flag.

Co-authored-by: Jean Boussier <byroot@ruby-lang.org>
2024-03-19 09:26:49 +01:00
Thomas Marshall
7e4b1f8e19
[Bug #20322] Fix rb_enc_interned_str_cstr null encoding
The documentation for `rb_enc_interned_str_cstr` notes that `enc` can be
a null pointer, but this currently causes a segmentation fault when
trying to autoload the encoding. This commit fixes the issue by checking
for NULL before calling `rb_enc_autoload`.
2024-03-03 10:43:35 +00:00
Peter Zhu
ce8531fed4 Stop using rb_str_locktmp_ensure publicly
rb_str_locktmp_ensure is a private API.
2024-02-23 14:08:29 -05:00
Takashi Kokubun
8a6740c70e
YJIT: Lazily push a frame for specialized C funcs (#10080)
* YJIT: Lazily push a frame for specialized C funcs

Co-authored-by: Maxime Chevalier-Boisvert <maxime.chevalierboisvert@shopify.com>

* Fix a comment on pc_to_cfunc

* Rename rb_yjit_check_pc to rb_yjit_lazy_push_frame

* Rename it to jit_prepare_lazy_frame_call

* Fix a typo

* Optimize String#getbyte as well

* Optimize String#byteslice as well

---------

Co-authored-by: Maxime Chevalier-Boisvert <maxime.chevalierboisvert@shopify.com>
2024-02-23 19:08:09 +00:00
Peter Zhu
510404f2de Stop using rb_fstring publicly
rb_fstring is a private API, so we should use rb_str_to_interned_str
instead, which is a public API.
2024-02-23 13:33:46 -05:00
Peter Zhu
df5b8ea4db Remove unneeded RUBY_FUNC_EXPORTED 2024-02-23 10:24:21 -05:00
Takashi Kokubun
d5080f6e8b Fix -Wsign-compare on String#initialize
../string.c:1886:57: warning: comparison of integer expressions of different signedness: ‘size_t’ {aka ‘long unsigned int’} and ‘long int’ [-Wsign-compare]
 1886 |                 if (STR_EMBED_P(str)) RUBY_ASSERT(osize <= str_embed_capa(str));
      |                                                         ^~
2024-02-22 16:11:30 -08:00
Nobuyoshi Nakada
e04146129e
[Bug #20292] Truncate embedded string to new capacity 2024-02-22 22:46:18 +09:00
Nobuyoshi Nakada
b1d70e4264
[Bug #20280] Check by rb_parser_enc_str_coderange
Co-authored-by: Yuichiro Kaneko <spiketeika@gmail.com>
2024-02-19 16:33:26 +09:00
Nobuyoshi Nakada
fcc55dc226
[Bug #20280] Raise SyntaxError on invalid encoding symbol 2024-02-19 16:33:26 +09:00
Peter Zhu
4d1b3a2bf3 Unset STR_SHARED when setting string to embed 2024-02-15 12:19:45 -05:00
Yusuke Endoh
25d74b9527 Do not include a backtick in error messages and backtraces
[Feature #16495]
2024-02-15 18:42:31 +09:00
Burdette Lamar
65f5435540
[DOC] Doc compliance (#9955) 2024-02-14 10:47:42 -05:00
Alan Wu
6261d4b4d8 Fix use-after-move in Symbol#inspect
The allocation could re-embed `orig_str` and invalidate the data
pointer from RSTRING_GETMEM() if the string is embedded.

Found on CI, where the test introduced in 7002e776944 ("Fix
Symbol#inspect for GC compaction") recently failed.

See: <https://github.com/ruby/ruby/actions/runs/7880657560/job/21503019659>
2024-02-13 14:49:54 -05:00
Aaron Patterson
c35fea8509
Specialize String#byteslice(a, b) (#9939)
* Specialize String#byteslice(a, b)

This adds a specialization for String#byteslice when there are two
parameters.

This makes our protobuf parser go from 5.84x slower to 5.33x slower

```
Comparison:
decode upstream (53738 bytes):     7228.5 i/s
decode protobuff (53738 bytes):     1236.8 i/s - 5.84x  slower

Comparison:
decode upstream (53738 bytes):     7024.8 i/s
decode protobuff (53738 bytes):     1318.5 i/s - 5.33x  slower
```

* Update yjit/src/codegen.rs

---------

Co-authored-by: Maxime Chevalier-Boisvert <maximechevalierb@gmail.com>
2024-02-13 16:20:27 +00:00
Peter Zhu
ac38f259aa Replace assert with RUBY_ASSERT in string.c
assert does not print the bug report, only the file and line number of
the assertion that failed. RUBY_ASSERT prints the full bug report, which
makes it much easier to debug.
2024-02-12 15:07:47 -05:00
Peter Zhu
c6b391214c [DOC] Improve flags of string 2024-02-08 10:49:38 -05:00
Peter Zhu
5e0c171451 Make io_fwrite safe for compaction
[Bug #20169]

Embedded strings are not safe for system calls without the GVL because
compaction can cause pages to be locked causing the operation to fail
with EFAULT. This commit changes io_fwrite to use rb_str_tmp_frozen_no_embed_acquire,
which guarantees that the return string is not embedded.
2024-02-05 11:11:07 -05:00
Takashi Kokubun
51753ec7fa
Annotate Symbol#to_s as leaf (#9769) 2024-01-31 10:47:35 -05:00
Peter Zhu
e17c83e02c Fix memory leak in String#tr and String#tr_s
rb_enc_codepoint_len could raise, which would cause the memory in buf
to leak.

For example:

    str1 = "\xE0\xA0\xA1#{" " * 100}".force_encoding("EUC-JP")
    str2 = ""
    str3 = "a".force_encoding("Windows-31J")

    10.times do
      1_000_000.times do
        str1.tr_s(str2, str3)
      rescue
      end

      puts `ps -o rss= -p #{$$}`
    end

Before:

    17536
    22752
    28032
    33312
    38688
    43968
    49200
    54432
    59744
    64992

After:

    12176
    12352
    12352
    12448
    12448
    12448
    12448
    12448
    12448
    12448
2024-01-17 08:54:25 -05:00
tompng
ade56737e2 Fix coderange of invalid_encoding_string.<<(ord)
Appending valid encoding character can change coderange from invalid to valid.
Example: "\x95".force_encoding('sjis')<<0x5C will be a valid string "\x{955C}"
2024-01-16 23:18:55 +09:00
Peter Zhu
b3d6128049 Fix memory leak in grapheme clusters
[Bug #20150]

String#grapheme_cluters and String#each_grapheme_cluster leaks memory
because if the string is not UTF-8, then the created regex will not
be freed.

For example:

    str = "hello world".encode(Encoding::UTF_32LE)

    10.times do
      1_000.times do
        str.grapheme_clusters
      end

      puts `ps -o rss= -p #{$$}`
    end

Before:

    26000
    42256
    59008
    75792
    92528
    109232
    125936
    142672
    159392
    176160

After:

    9264
    9504
    9808
    10000
    10128
    10224
    10352
    10544
    10704
    10896
2024-01-08 09:14:04 -05:00
Peter Zhu
5aba5f0454 [DOC] Add parentheses in call-seq for String#include? 2024-01-02 19:19:12 -05:00
Peter Zhu
7002e77694 Fix Symbol#inspect for GC compaction
The test fails when RGENGC_CHECK_MODE is turned on:

    1) Failure:
    TestSymbol#test_inspect_under_gc_compact_stress [test/ruby/test_symbol.rb:123]:
    <":testing"> expected but was
    <":\x00\x00\x00\x00\x00\x00\x00">.
2023-12-24 21:29:40 -05:00
Peter Zhu
50bf437341 Fix String#sub for GC compaction
The test fails when RGENGC_CHECK_MODE is turned on:

    TestString#test_sub_gc_compact_stress = 9.42 s
    1) Failure:
    TestString#test_sub_gc_compact_stress [test/ruby/test_string.rb:2089]:
    <"aaa [amp] yyy"> expected but was
    <"aaa [] yyy">.
2023-12-23 18:00:27 -05:00
Nobuyoshi Nakada
ab7f54688b
Stir the hash value more with encoding index 2023-12-17 00:30:00 +09:00
Nobuyoshi Nakada
b710f96b5a
[Bug #20068] Encoding does not matter to empty strings 2023-12-16 16:00:12 +09:00
Jeremy Evans
0d53dba7ce Make String#chomp! raise ArgumentError for 2+ arguments if string is empty
String#chomp! returned nil without checking the number of passed
arguments in this case.
2023-12-13 07:05:21 -08:00
Peter Zhu
ee0eca191f Make String#undump compaction safe 2023-12-01 15:04:31 -05:00
Peter Zhu
80ea7fbad8 Pin embedded shared strings
Embedded shared strings cannot be moved because strings point into the
slot of the shared string. There may be code using the RSTRING_PTR on
the stack, which would pin the string but not pin the shared string,
causing it to move.
2023-12-01 15:04:31 -05:00
Peter Zhu
3d908a41ab Guard match from GC in String#gsub
We need to guard match from GC because otherwise it could end up being
reclaimed or moved in compaction.
2023-11-29 19:21:40 -05:00
Peter Zhu
94015e0dce Guard match from GC when scanning string
We need to guard match from GC because otherwise it could end up being
reclaimed or moved in compaction.
2023-11-27 16:49:52 -05:00
Jean Boussier
83c385719d Specialize String#dup
`String#+@` is 2-3 times faster than `String#dup` because it can
directly go through `rb_str_dup` instead of using the generic
much slower `rb_obj_dup`.

This fact led to the existance of the ugly `Performance/UnfreezeString`
rubocop performance rule that encourage users to rewrite the much
more readable and convenient `"foo".dup` into the ugly `(+"foo")`.

Let's make that rubocop rule useless.

```
compare-ruby: ruby 3.3.0dev (2023-11-20T02:02:55Z master 701b0650de) [arm64-darwin22]
last_commit=[ruby/prism] feat: add encoding for IBM865 (https://github.com/ruby/prism/pull/1884)
built-ruby: ruby 3.3.0dev (2023-11-20T12:51:45Z faster-str-lit-dup 6b745bbc5d) [arm64-darwin22]
warming up..

|       |compare-ruby|built-ruby|
|:------|-----------:|---------:|
|uplus  |     16.312M|   16.332M|
|       |           -|     1.00x|
|dup    |      5.912M|   16.329M|
|       |           -|     2.76x|
```
2023-11-20 14:33:20 +01:00
Jean Boussier
ea1b1ea1aa String#force_encoding don't clear coderange if encoding is unchanged
Some code out there blind calls `force_encoding` without checking
what the original encoding was, which clears the coderange uselessly.

If the String is big, it can be a rather costly mistake.

For instance the `rack-utf8_sanitizer` gem does this on request
bodies.
2023-11-09 12:38:10 +01:00
Nobuyoshi Nakada
1910bd4247
String for string literal is not resizable 2023-11-08 00:59:45 +09:00
Jean Boussier
ac8ec004e5 Make String.new size pools aware.
If the required capacity would fit in an embded string,
returns one.

This can reduce malloc churn for code that use string buffers.
2023-11-02 23:34:58 +01:00
Nobuyoshi Nakada
50520cc193
[DOC] Missing comment markers 2023-09-27 16:18:05 +09:00
Nobuyoshi Nakada
6b66b5fded [Bug #19902] Update the coderange regarding the changed region 2023-09-26 15:35:40 +09:00
John Hawthorn
d89b15cdce Use end of char boundary in start_with?
Previously we used the next character following the found prefix to
determine if the match ended on a broken character.

This had caused surprising behaviour when a valid character was followed
by a UTF-8 continuation byte.

This commit changes the behaviour to instead look for the end of the
last character in the prefix.

[Bug #19784]

Co-authored-by: ywenc <ywenc@github.com>
Co-authored-by: Nobuyoshi Nakada <nobu@ruby-lang.org>
2023-09-01 16:23:28 -07:00
Nobuyoshi Nakada
b054c2fe06 [Bug #19784] Fix behaviors against prefix with broken encoding
- String#start_with?
- String#delete_prefix
- String#delete_prefix!
2023-08-26 08:58:02 +09:00
Nobuyoshi Nakada
00ac3a64ba Introduce at_char_boundary function 2023-08-26 08:58:02 +09:00
Alan Wu
2214bcb70d Fix premature string collection during append
Previously, the following crashed due to use-after-free
with AArch64 Alpine Linux 3.18.3 (aarch64-linux-musl):

```ruby
str = 'a' * (32*1024*1024)
p({z: str})
```

32 MiB is the default for `GC_MALLOC_LIMIT_MAX`, and the crash
could be dodged by setting `RUBY_GC_MALLOC_LIMIT_MAX` to large values.
Under a debugger, one can see the `str2` of rb_str_buf_append()
getting prematurely collected while str_buf_cat4() allocates capacity.

Add GC guards so the buffer of `str2` lives across the GC run
initiated in str_buf_cat4().

[Bug #19792]
2023-08-23 18:07:49 -04:00
Peter Zhu
837c12b0c8 Use STR_EMBED_P instead of testing STR_NOEMBED 2023-08-22 16:31:36 -04:00
Peter Zhu
724223b4ca Don't check for STR_NOEMBED in rb_fstring
We don't need to check for STR_NOEMBED because the check above for
STR_EMBED_P means that it can never be false.
2023-08-18 09:24:45 -04:00
Burdette Lamar
0e162457d6
[DOC] Don't suppress autolinks (#8208) 2023-08-11 19:22:21 -04:00
Kunshan Wang
132f097149 No computing embed_capa_max in str_subseq
Fix str_subseq so that it does not attempt to predict the size of the
object returned by str_alloc_heap.
2023-08-03 14:52:44 -04:00
Nobuyoshi Nakada
af04e26924
Fill terminator properly 2023-07-28 22:17:53 +09:00
alexandre184
e5825de7c9
[Bug #19769] Fix range of size 1 in String#tr 2023-07-15 16:36:53 +09:00
Nobuyoshi Nakada
9dcdffb8bf
Make the string index functions closer to symmetric
So that irregular parts may be more noticeable.
2023-07-09 18:45:51 +09:00
Nobuyoshi Nakada
5e79d5a560
Make rb_str_rindex return byte index
Leave callers to convert byte index to char index, as well as
`rb_str_index`, so that `rb_str_rpartition` does not need to
re-convert char index to byte index.
2023-07-09 16:39:28 +09:00