-
Notifications
You must be signed in to change notification settings - Fork 30
Description
Bug Description
I believe I found an issue with how the cloud-sql-go-connector instance refresh functionality handles errors, or more specifically, cancelled context errors.
I have a process that runs as a long-running web server and uses the cloud-sql-go-connector to connect to postgres Cloud SQL instances. I noticed extremely high CPU usage after running a few tests and then deleting the Cloud SQL instance that it was connecting to. After doing some profiling and looking through the cloud sql connector code, I think I have a decent understanding of the issue (please correct anything that is wrong).
The fetchMetadata function attempts to Get the instance details using the sqladmin library here. Which as expected, if the instance has been deleted, returns an error here. Tracing that back a few more functions, I believe it is called from scheduleRefresh in this section of code. It appears if there is any error returned from the refresh, it immediately tries to schedule another one, but since the instance is deleted, it will just enter into an endless refresh loop and attempts to hold on to the last known good connection. Should this be able to handle unrecoverable errors (like deleted instances) and return more gracefully? At first I thought this was the primary issue, but to me it raised a larger question: why was the connector still refreshing in the background at all since I closed the dialer and connection (see example code below)?
The dialer looked to be cancelled properly here which propagated to the instance close and would cancel the internal context.
However, this line returns an error if the context was cancelled. And this code always treats all errors the same (by scheduling another refresh).
I believe this needs to check if the returned error is context.Cancelled, and if it is, just return. If this is accurate, I would love to submit a PR!
Example code (or command)
package main
import (
"context"
"fmt"
"log"
"net"
"net/http"
"time"
"cloud.google.com/go/cloudsqlconn"
"github.com/jackc/pgx/v4/pgxpool"
)
const (
username = "postgres"
password = "TODO"
project = "TODO"
region = "TODO" // e.g. us-central1
instance = "TODO"
)
func main() {
handler := func(w http.ResponseWriter, req *http.Request) {
msg := "success"
if err := selectOne(); err != nil {
msg = err.Error()
}
w.Write([]byte(msg))
}
http.HandleFunc("/", handler)
log.Fatal(http.ListenAndServe(":8080", nil))
}
func selectOne() error {
ctx, cancel := context.WithTimeout(context.Background(), time.Minute*2)
defer cancel()
dsn := fmt.Sprintf("user=%s password=%s dbname=postgres sslmode=disable", username, password)
config, err := pgxpool.ParseConfig(dsn)
if err != nil {
return fmt.Errorf("failed to parse pgx config: %v", err)
}
dialer, err := cloudsqlconn.NewDialer(ctx)
if err != nil {
return fmt.Errorf("failed to initialize dialer: %v", err)
}
defer dialer.Close()
connectionName := fmt.Sprintf("%s:%s:%s", project, region, instance)
config.ConnConfig.DialFunc = func(ctx context.Context, _ string, instance string) (net.Conn, error) {
return dialer.Dial(ctx, connectionName)
}
conn, err := pgxpool.ConnectConfig(context.Background(), config)
if err != nil {
return fmt.Errorf("failed to connect: %w", err)
}
defer conn.Close()
if _, err = conn.Exec(ctx, "select 1"); err != nil {
return err
}
return nil
}Stacktrace
n/a
How to reproduce
- Create a Cloud SQL postgres instance in the GCP console
- Run the above example code
go run main.go(replacing theconstvalues appropriately) - Run
curl -XPOST localhost:8080to trigger the db connection - Delete the Cloud SQL instance in the GCP console
- Wait some amount of time (when the ephemeral cert expires, which I believe is 1 hour) and monitor CPU usage of the program
Environment
- OS type and version:
macOS 13.0(this same issue occurs in a linux container deployed to a GKE cluster) - Go version:
go version go1.19.2 darwin/amd64 - Connector version:
cloud.google.com/go/cloudsqlconn v1.0.1