The implementation of FastWrite consists of three steps: (1) scanning the object to calculate the required memory, (2) allocating memory, and (3) completing serialization. This approach eliminates the need for slice appending, reducing memory allocation and copying, resulting in significantly improved performance.
However, if there is a concurrency bug in the user code between step 1 (calculating the required memory) and step 3 (serialization), where another goroutine concurrently writes to the object being serialized (request parameters for the client or response values for the server), it can lead to Kitex reading an inconsistent object, resulting in errors or even panics.
Note:
FastWriteField
method of Kitex, the root cause of the error lies in the user code, which is a typical concurrency issue.The following error messages may occur:
runtime error: index out of range [3] with length 3
runtime error: slice bounds out of range [86:59]
runtime error: invalid memory address or nil pointer dereference
A typical scenario involves a field referenced within the request being a global variable (or a cached object) that can be concurrently written to.
The simplified error message is as follows:
KITEX: panic, ..., error=runtime error: invalid memory address or nil pointer dereference
panic(...):
github.com/cloudwego/kitex/pkg/protocol/bthrift.binaryProcotol.WriteBinaryNocopy(...)
git/to/project/kitex_gen/some_package.(*SomeType).fastWriteField2(...)
The panic stack includes fastWriteField2
, which is a typical case of concurrent read and write in business logic. The simplified business code is as follows:
key := "default" // reflect.StringHeader{Data=0xXXX, Len=7}
fallbackKey = "" // reflect.StringHeader{Data=nil, Len=0}
wg := sync.WaitGroup{}
for _, task := range taskList {
wg.Add(1)
go func() {
defer wg.Done()
if someCondition {
key = fallbackKey // `key` may be read by another goroutine
}
kitexClient.GetByKey(ctx, &Request{Key: key})
}()
}
wg.Wait()
Analysis:
key
variable of type string
may be read/written by different goroutines within the loop.string
is not thread-safe: In the Go Runtime implementation, a reflect.StringHeader{Data uintptr, Len int}
is used with two fields that need to be assigned separately.fastWriteField2
method.fastWriteField2
attempts to read 7 characters from Data = nil
, it triggers a nil pointer dereference
panic.Note:
reflect.SliceHeader
and have Data
, Len
, and Cap
fields.fallbackKey = "123456789"
in the code above, the serialized data may be read as “1234567”. If fallbackKey = "123"
, it could result in out-of-bounds data access or even panics.A typical scenario is when a field referenced in the response is a global variable (or a cached object), and that object may be concurrently written to.
The error message for this case is as follows (slightly simplified):
KITEX: panic happened, ..., error=<Error: runtime error: index out of range [3] with length 1>
panic(...)
encoding/binary.bigEndian.PutUint32(...)
github.com/cloudwego/kitex/pkg/protocol/bthrift.binaryProtocol.WriteI32(...)
git/to/project/kitex_gen/some_package.(*SomeType).fastWriteField3(...)
From the panic stack, it can be seen that the fastWriteField3
method generated by Kitex triggers the panic. It is a typical case of concurrent read and write in business logic. The simplified business code can be as follows:
var localCache sync.Map{}
func (*Handler) GetByKey(ctx context.Context, req *xxx.Request) (*xxx.Response, error) {
resp := localCache.Get(req.Key)
resp.UserID = req.UserID
return resp, nil
}
Analysis:
PutUint32(b []byte, v uint32)
checks if b[3]
is readable before writing v
to b
. The panic occurs here, indicating an out-of-range access.FastWrite
has already traversed resp
to determine the required length, indicating that between “(1) calculating required memory” and “(3) serialization,” a field in resp
has been modified, requiring more memory space.Key
as “X”):A.Request: {Key="X", UserID = "123" }
B.Request: {Key="X", UserID = "123456"}
FastWrite
for request A calculates that it needs N+3 bytes (N represents the space required by other fields in resp
) and allocates the memory.UserID
with “123456” (note that the response is cached, which means A will also return this object).resp
into the allocated space. When it tries to write the last field, UserID
, it finds that there is not enough space available (it now needs N+6 bytes): index out of range [3] with length 1
Note:
If there is concurrent read and write on the Request
or Response
(including objects directly or indirectly referenced by them), fix this issue first.
If possible, use the -race
flag to identify and eliminate concurrency issues in the code. Refer to https://go.dev/blog/race-detector for more details.
Note: Be cautious when using it in a production environment as it can significantly impact performance.
Based on the panic stack, you can identify the specific field by looking at the field type and fastWriteFieldN
method.
For example, if the innermost method in the panic stack is: kitex_gen/some_package.(***Base**).**fastWriteField3**(...)
, it means that the error is caused by the field with index 3 in the Base
type. In the following IDL, the field at index 3 is the Addr
field of type string
:
struct Base {
1: string LogID = "",
2: string Caller = "",
3: string Addr = "",
}
If the panic error message contains invalid memory address or nil pointer dereference
, it indicates that the Addr
field may have been concurrently written with a null value by another goroutine.
If the panic error message is different (e.g., index out of range [3] with length 3
), it may indicate that another field is being concurrently read and written (variable length) and is occupying more memory, causing insufficient memory allocation when writing the field. You can refer to the “Comparing Two Sampling Results” section below to locate the issue.
Client Side
The proposed solution for the client side is as follows:
First Sampling: Before sending the request, serialize the entire request body into JSON.
Second Sampling (two approaches for reference):
a. Panic Sampling: Check if a panic has occurred and then serialize the request body into JSON.
b. Delayed Sampling: Create a goroutine that sleeps for a certain amount of time and then serialize the request body into JSON.
Compare the results of the two samplings to identify different fields and investigate code sections where concurrent read/write operations may be happening.
Note:
Server Side
On the server side, as the server has exited all middleware during encoding, it is not possible to capture panics or perform sampling comparisons using middleware.
The proposed solution for the server side is as follows:
type codecForPanic struct {
payloadCodec remote.PayloadCodec
}
func (c *codecForPanic) Marshal(ctx context.Context, message remote.Message, out remote.ByteBuffer) error {
var before, after []byte
var err error
defer func() {
if err := recover(); err != nil {
klog.Errorf("panic: %v", err)
after, _ = json.Marshal(message.Data())
if bytes.Compare(before, after) != 0 {
klog.Errorf("before = %s, after = %s", before, after)
}
}
}()
before, err = json.Marshal(message.Data()) // Note the performance loss
if err != nil {
klog.Errorf("json encode before Marshal failed: err = %v", err)
}
return c.payloadCodec.Marshal(ctx, message, out)
}
func (c *codecForPanic) Unmarshal(ctx context.Context, message remote.Message, in remote.ByteBuffer) error {
// Recover and compare here
return c.payloadCodec.Unmarshal(ctx, message, in)
}
func (c *codecForPanic) Name() string {
return "codecForPanic"
}
svr := test.NewServer(new(TestServiceImpl),
server.WithPayloadCodec(&codecForPanic{
payloadCodec: thrift.NewThriftCodecWithConfig(thrift.FastRead | thrift.FastWrite),
}),
// Other options
)
If panics occur infrequently, you can also consider starting a new goroutine, sleeping for a period of time, and then performing the comparison. This approach makes it easier to identify the modified sections.
runtime error: slice bounds out of range [1600217702:1678338]
The received message (Client: response, Server: request) is encoded incorrectly.
Refer to the troubleshooting suggestions for FastWrite and investigate the corresponding endpoint (if the error occurs on the server side, check the client; vice versa).
It is not recommended to disable it: Race conditions in the business logic can lead to inconsistent business data and unpredictable consequences.
If there is a genuine need, you can choose one of the following options:
FastCodec
in Thrift using options:// Server side
svr := xxxService.NewServer(handler, server.WithPayloadCodec(
thrift.NewThriftCodecDisableFastMode(true, false)))
// Client side
cli, err := xxxService.NewClient("", client.WithPayloadCodec(
thrift.NewThriftCodecDisableFastMode(true, false)))
FastCodec
encoding/decoding code: Use the -no-fast-api
parameter of the Kitex command-line tool to regenerate the code.