test: events sent during remote join handshake should not be lost#847
test: events sent during remote join handshake should not be lost#847FrenchGithubUser wants to merge 2 commits intomatrix-org:mainfrom
Conversation
3b8e96d to
e5b41dc
Compare
There was a problem hiding this comment.
Pull request overview
Adds a regression test to ensure events created during the federation join handshake (between /make_join and /send_join) remain discoverable to the joining server, preventing missed events / unreachable forward extremities during joins.
Changes:
- Added
TestEventBetweenMakeJoinAndSendJoinIsNotLostto assert intervening events become discoverable (directly or viaprev_eventsreferences). - Implemented a custom federation
/_matrix/federation/v1/send/{transactionID}handler to observe outbound PDUs before the Complement server is in the room. - Added
ioimport to read raw transaction bodies.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| // We track the message event ID sent between make_join and send_join. | ||
| // After send_join, we wait for hs1 to send us either: | ||
| // - the message event itself, or | ||
| // - any event whose prev_events reference the message (e.g. a dummy event) | ||
| var messageEventID string | ||
| messageDiscoverableWaiter := helpers.NewWaiter() | ||
|
|
||
| srv := federation.NewServer(t, deployment, | ||
| federation.HandleKeyRequests(), | ||
| ) | ||
| srv.UnexpectedRequestsAreErrors = false | ||
|
|
||
| // Custom /send handler: the Complement server won't be in the room until | ||
| // send_join completes, so we can't use HandleTransactionRequests (which | ||
| // requires the room in srv.rooms). Instead we parse the raw transaction. | ||
| srv.Mux().Handle("/_matrix/federation/v1/send/{transactionID}", http.HandlerFunc(func(w http.ResponseWriter, req *http.Request) { | ||
| body, _ := io.ReadAll(req.Body) | ||
| txn := gjson.ParseBytes(body) | ||
| txn.Get("pdus").ForEach(func(_, pdu gjson.Result) bool { | ||
| eventID := pdu.Get("event_id").String() | ||
| eventType := pdu.Get("type").String() | ||
| t.Logf("Received PDU via /send: type=%s id=%s", eventType, eventID) | ||
|
|
||
| if messageEventID == "" { | ||
| return true | ||
| } |
There was a problem hiding this comment.
I'm not a Go expert so I don't know for sure whether this is relevant but it sounds plausible.
Based on https://tip.golang.org/ref/mem, without synchronization, writes are not guaranteed to be observed by other goroutines
the assignment to a is not followed by any synchronization event, so it is not guaranteed to be observed by any other goroutine. In fact, an aggressive compiler might delete the entire go statement.
If the effects of a goroutine must be observed by another goroutine, use a synchronization mechanism such as a lock or channel communication to establish a relative ordering.
| // Check if this event's prev_events reference the message | ||
| // (e.g. a dummy event tying the forward extremities together). | ||
| pdu.Get("prev_events").ForEach(func(_, prevEvent gjson.Result) bool { | ||
| if prevEvent.String() == messageEventID { | ||
| messageDiscoverableWaiter.Finish() | ||
| return false | ||
| } | ||
| return true | ||
| }) |
There was a problem hiding this comment.
Comment why this is good enough.
I assume because people can backfill to get the message.
But then this brings up the question, what if it's a prev_event further back in a chain? (we currently don't detect that as good enough) - we need to at-least explain this edge case although perhaps we should /backfill up to the beginning of the room and ensure the messageEventID was in there.
There was a problem hiding this comment.
It looks like this is partially addressed below but we should also explain it here:
// Step 4: hs1 should make the missed message discoverable to the joining
// server. We accept either receiving the message event directly, or
// receiving any event whose prev_events reference it (allowing the
// joining server to backfill).
There was a problem hiding this comment.
This was updated to the following:
// Check if this event's prev_events directly reference the message
// (e.g. a dummy event tying the two forward extremities together).
// If so, the joining server can backfill from that event and will
// discover the message. We only check one level of prev_events:
// if the reference is deeper in the DAG the joining server can
// still reach the message through backfill.
This sounds pretty good now but I think it needs a slight adjustment to set the right expecations:
// Check if this event's prev_events directly reference the message (e.g. a dummy
// event tying the two forward extremities together). If so, the joining server
// can backfill from that event and will discover the message.
//
// XXX: We only check one level of prev_events: if the reference is deeper in the
// DAG, it's valid and the joining server can still reach the message through
// backfill but our checks don't account for that yet (feel free to edit this
// assertion if you run into this)
95d0c03 to
0b5cea2
Compare
MadLittleMods
left a comment
There was a problem hiding this comment.
Thanks for the updates so far @FrenchGithubUser! This is looking close
| if err != nil { | ||
| t.Fatalf("failed to read request body in /send handler: %v", err) | ||
| return | ||
| } |
There was a problem hiding this comment.
We can use must.NotError(...) here
| // Check if this event's prev_events reference the message | ||
| // (e.g. a dummy event tying the forward extremities together). | ||
| pdu.Get("prev_events").ForEach(func(_, prevEvent gjson.Result) bool { | ||
| if prevEvent.String() == messageEventID { | ||
| messageDiscoverableWaiter.Finish() | ||
| return false | ||
| } | ||
| return true | ||
| }) |
There was a problem hiding this comment.
This was updated to the following:
// Check if this event's prev_events directly reference the message
// (e.g. a dummy event tying the two forward extremities together).
// If so, the joining server can backfill from that event and will
// discover the message. We only check one level of prev_events:
// if the reference is deeper in the DAG the joining server can
// still reach the message through backfill.
This sounds pretty good now but I think it needs a slight adjustment to set the right expecations:
// Check if this event's prev_events directly reference the message (e.g. a dummy
// event tying the two forward extremities together). If so, the joining server
// can backfill from that event and will discover the message.
//
// XXX: We only check one level of prev_events: if the reference is deeper in the
// DAG, it's valid and the joining server can still reach the message through
// backfill but our checks don't account for that yet (feel free to edit this
// assertion if you run into this)
| // We track the message event ID sent between make_join and send_join. | ||
| // After send_join, we wait for hs1 to send us either: | ||
| // - the message event itself, or | ||
| // - any event whose prev_events reference the message (e.g. a dummy event) | ||
| var messageEventID string | ||
| messageDiscoverableWaiter := helpers.NewWaiter() | ||
|
|
||
| srv := federation.NewServer(t, deployment, | ||
| federation.HandleKeyRequests(), | ||
| ) | ||
| srv.UnexpectedRequestsAreErrors = false | ||
|
|
||
| // Custom /send handler: the Complement server won't be in the room until | ||
| // send_join completes, so we can't use HandleTransactionRequests (which | ||
| // requires the room in srv.rooms). Instead we parse the raw transaction. | ||
| srv.Mux().Handle("/_matrix/federation/v1/send/{transactionID}", http.HandlerFunc(func(w http.ResponseWriter, req *http.Request) { | ||
| body, _ := io.ReadAll(req.Body) | ||
| txn := gjson.ParseBytes(body) | ||
| txn.Get("pdus").ForEach(func(_, pdu gjson.Result) bool { | ||
| eventID := pdu.Get("event_id").String() | ||
| eventType := pdu.Get("type").String() | ||
| t.Logf("Received PDU via /send: type=%s id=%s", eventType, eventID) | ||
|
|
||
| if messageEventID == "" { | ||
| return true | ||
| } |
There was a problem hiding this comment.
I'm not a Go expert so I don't know for sure whether this is relevant but it sounds plausible.
Based on https://tip.golang.org/ref/mem, without synchronization, writes are not guaranteed to be observed by other goroutines
the assignment to a is not followed by any synchronization event, so it is not guaranteed to be observed by any other goroutine. In fact, an aggressive compiler might delete the entire go statement.
If the effects of a goroutine must be observed by another goroutine, use a synchronization mechanism such as a lock or channel communication to establish a relative ordering.
Test for element-hq/synapse#19390
Disclaimer
As I'm not super fluent with go, I used a lot of AI to write this test. However, I did review it as best as I could together with someone else from Famedly.
Pull Request Checklist
Signed-off-by: Thomas Traineau t.traineau@famedly.com