Skip to content

Context cancellation ignored causing heavy CPU usage #370

@jault3

Description

@jault3

Bug Description

I believe I found an issue with how the cloud-sql-go-connector instance refresh functionality handles errors, or more specifically, cancelled context errors.

I have a process that runs as a long-running web server and uses the cloud-sql-go-connector to connect to postgres Cloud SQL instances. I noticed extremely high CPU usage after running a few tests and then deleting the Cloud SQL instance that it was connecting to. After doing some profiling and looking through the cloud sql connector code, I think I have a decent understanding of the issue (please correct anything that is wrong).

The fetchMetadata function attempts to Get the instance details using the sqladmin library here. Which as expected, if the instance has been deleted, returns an error here. Tracing that back a few more functions, I believe it is called from scheduleRefresh in this section of code. It appears if there is any error returned from the refresh, it immediately tries to schedule another one, but since the instance is deleted, it will just enter into an endless refresh loop and attempts to hold on to the last known good connection. Should this be able to handle unrecoverable errors (like deleted instances) and return more gracefully? At first I thought this was the primary issue, but to me it raised a larger question: why was the connector still refreshing in the background at all since I closed the dialer and connection (see example code below)?

The dialer looked to be cancelled properly here which propagated to the instance close and would cancel the internal context.

However, this line returns an error if the context was cancelled. And this code always treats all errors the same (by scheduling another refresh).

I believe this needs to check if the returned error is context.Cancelled, and if it is, just return. If this is accurate, I would love to submit a PR!

Example code (or command)

package main

import (
	"context"
	"fmt"
	"log"
	"net"
	"net/http"
	"time"

	"cloud.google.com/go/cloudsqlconn"
	"github.com/jackc/pgx/v4/pgxpool"
)

const (
	username = "postgres"
	password = "TODO"
	project  = "TODO"
	region   = "TODO" // e.g. us-central1
	instance = "TODO"
)

func main() {
	handler := func(w http.ResponseWriter, req *http.Request) {
		msg := "success"
		if err := selectOne(); err != nil {
			msg = err.Error()
		}
		w.Write([]byte(msg))
	}

	http.HandleFunc("/", handler)
	log.Fatal(http.ListenAndServe(":8080", nil))
}

func selectOne() error {
	ctx, cancel := context.WithTimeout(context.Background(), time.Minute*2)
	defer cancel()

	dsn := fmt.Sprintf("user=%s password=%s dbname=postgres sslmode=disable", username, password)
	config, err := pgxpool.ParseConfig(dsn)
	if err != nil {
		return fmt.Errorf("failed to parse pgx config: %v", err)
	}

	dialer, err := cloudsqlconn.NewDialer(ctx)
	if err != nil {
		return fmt.Errorf("failed to initialize dialer: %v", err)
	}
	defer dialer.Close()

	connectionName := fmt.Sprintf("%s:%s:%s", project, region, instance)

	config.ConnConfig.DialFunc = func(ctx context.Context, _ string, instance string) (net.Conn, error) {
		return dialer.Dial(ctx, connectionName)
	}

	conn, err := pgxpool.ConnectConfig(context.Background(), config)
	if err != nil {
		return fmt.Errorf("failed to connect: %w", err)
	}
	defer conn.Close()

	if _, err = conn.Exec(ctx, "select 1"); err != nil {
		return err
	}

	return nil
}

Stacktrace

n/a

How to reproduce

  1. Create a Cloud SQL postgres instance in the GCP console
  2. Run the above example code go run main.go (replacing the const values appropriately)
  3. Run curl -XPOST localhost:8080 to trigger the db connection
  4. Delete the Cloud SQL instance in the GCP console
  5. Wait some amount of time (when the ephemeral cert expires, which I believe is 1 hour) and monitor CPU usage of the program

Environment

  1. OS type and version: macOS 13.0 (this same issue occurs in a linux container deployed to a GKE cluster)
  2. Go version: go version go1.19.2 darwin/amd64
  3. Connector version: cloud.google.com/go/cloudsqlconn v1.0.1

Metadata

Metadata

Assignees

Labels

priority: p1Important issue which blocks shipping the next release. Will be fixed prior to next release.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions