Skip to content

pickle protocol 5: writable in-band PickleBuffer with empty tobytes() can memoize b'' via BYTEARRAY8, so later bytes references unpickle as bytearray #148914

@madkrouger

Description

@madkrouger

Bug report

Bug description:

The problem

  • for an in-band, writable PickleBuffer, the pickler does buf = m.tobytes(). That value is always bytes
  • if the buffer is empty, buf is always the b'' singleton, so id(buf) is the same for every such call
  • the pickler then does in_memo = id(buf) in self.memo
  • so once b'' has been memoized anywhere earlier in the same dump, every later empty in-band tobytes() in that dump sees in_memo is true
  • so any later part of the same pickle that is supposed to store b'' as bytes may instead reuse that memo. The unpickler then returns the bytearray, not bytes.

Code that requires bytes (strict isinstance etc.) can fail during unpickling when it receives a bytearray.

Excerpt from pickler.py, but the same applies to the C implementation:

                        if in_memo:
                            pb_branch = "_save_bytearray_no_memo"
                            self._save_bytearray_no_memo(buf)
                        else:
                            pb_branch = "save_bytearray"
                            self.save_bytearray(buf)

One minimal repro example is:

import dill
from pickle import PickleBuffer

# a bit artificial example to trigger wrong flow
def repro_minimal() -> None:
    pb = PickleBuffer(memoryview(bytearray()))

    def f():
        pass

    blob = dill.dumps((pb, f), protocol=5)
    # it fails when trying to pass a bytearray instead of bytes to dill's _create_code
    dill.loads(blob)  
    print("dill.loads ok")

Less synthetic if you have Pandas and PyArrow

import pandas as pd
def repro_arrow_empty_dataframe() -> None:
    col = "EMPTY_STRING_COLUMN"
    df = pd.DataFrame({col: pd.Series([""], dtype="string[pyarrow]")})

    def g():
        pass

    blob = dill.dumps((df, g), protocol=5)
    dill.loads(blob)
    print("dill.loads ok")

Both result in TypeError: code() argument 16 must be bytes, not bytearray

However, that code() case is only one example; the underlying issue is bytes vs bytearray memo reuse, and similar failures can appear anywhere bytes are required.

CPython versions tested on:

3.13

Operating systems tested on:

macOS

Metadata

Metadata

Assignees

No one assigned

    Labels

    stdlibStandard Library Python modules in the Lib/ directorytype-bugAn unexpected behavior, bug, or error

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions