Profiling a running ACS instance

ACS instance profiling

ACS isn’t devoid of bugs, nor is it the most efficient process. That isn’t to say that it has to be 100% efficient to be successful, but it would be nice to have to use only resources it absolutely needs and little else. Go provides for some useful introspective tooling to perform some useful profiling.

HTTP endpoints

ACS has been configured to allow for completely open access to the read-only route data rooted at /debug. There are, as of this writing, really only 2 routes:

debug/stats

The debug/stats route emits some potentially useful data about the HTTP traffic. You should get something like below.

 {
 	"pid":17181,
 	"uptime":"18m37.244701274s",
 	"uptime_sec":1117.244701274,
 	"time":"2018-06-25 09:03:20.60792218 -0600 MDT m=+1121.465433394",
 	"unixtime":1529939000,
 	"status_code_count":{
 		"401":1
 	},
 	"total_status_code_count":{
 		"200":228,
 		"401":1118,
 		"404":6
 	},
 	"count":1,
 	"total_count":1352,
 	"total_response_time":"2m0.726762259s",
 	"total_response_time_sec":120.726762259,
 	"average_response_time":"89.294942ms",
 	"average_response_time_sec":0.089294942
 }

debug/pprof/

This contains the standard Go profiler support routes. There are multiple tools that can hook into this and create profile dumps that can be analyzed at a later time, or realtime. This document is not sufficient for explainging all the useability, and is meant to be a guide.

To be of any real use, the following setting in ACS’s general.yaml must be set to something on the order of 90s, otherwise longer profiling measurement periods cannot be attempted.:

Network:
	# Setting this to a low period, might prematurely close connections
	# Too high and you risk allowing too many connection to remain open
	# 90-180s might be 'just right'
	WriteTimeout: 90s

Web

As a first pass, you can simply browse to http://acs:port/debug/pprof/ (note the dangling /) and be presented with a really barren, but useful, webpage,

pprof

This is the standard tool provided with Go. You should really consult pprof’s pkg docs, and the Profiling Go Programs Blog Post. General usage is via

	go tool pprof http://acs:port/debug/pprof/profile

or similar for the /profile, /heap, /goroutine, /block, /mutex, and /threadcreate subroutes.

pprof can create some useful diagrams assuming you have graphviz:

	(brew|apt|dep|pkg) install gperftools graphiviz
	pprof –web localhost:6060/debug

go-torch

go-torch is a tool to generate torch diagrams to aid in application profiling. You first need to install the tool via:

go get github.com/uber/go-torch #install go-torch
cd $GOPATH/src/github.com/uber/go-torch
git clone https://github.com/brendangregg/FlameGraph.git #fetch FlameGraph deps.

When that is done, you should be able to directly connect to a running acs instance and create svg plots via:

go-torch -u http://avaps-dev.eol.ucar.edu:8080/debug/pprof/  --file="torch.svg"

Charts

ACS makes use of the github.com/mkevac/debugcharts package, so when connected to the internet (eg, ground testing), you should be able to access https://acs:port/debug/charts/ for some per-second updated charts.

Memory Leak Debugging

The best way to learn is usually an example. Over the time frame of roughly July 3rd to July 12th, I noticed that the ACS instance was gobbling up large amounts of memmory. just idling. Which is bad. Most of the commits in that range are various steps of locating the problems and attempting to fix it. The problem was first noticed by looking at the charts route which has some pretty useful time series charts that showed the heap growing indefintely.

Time to roll up the sleeves. First I cobbled together a script that automated gathering route data from analysis later - the key part is below:

function sampleize() {
  go tool pprof -proto "$URL/debug/pprof/$1" 2> $2.log > $2.pb.gz
  go tool pprof -top $4 $2.pb.gz >> $2.log
  go-torch --title "$3" -f $2.svg $2.pb.gz &> /dev/null
}
sampleize "heap" "heap" "Heap" "-sample_index=inuse_space"

What this does is grab the heap data from the /debug/pprof/heap route and dumps it to a gzipped protocol buffer file. From there, we can use the go tool pprof tooling to analyze the snapshot, and start to get a useful idea of what is going on. Ok, so now we have a snapshot, what can we do with it? Essentially, where’s the (memory leak) beef?

For our usage, I have taken 3 snapshots: one taken just after startup, and the next two are taken about 10 mins after each other. The first thing to do is check is what is happening just after start up:

  λ go tool pprof 1c095e0818f595cf7e6f065334cda1aadfbc09ec/heap.pb.gz
File: acs-daemon
Build ID: 9f52a8408ff14ff129c7b44bfd637d7f350a1139
Type: inuse_space
Time: Jul 16, 2018 at 2:05pm (MDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 20
Showing nodes accounting for 48101.46kB, 100% of 48101.46kB total
Showing top 20 nodes out of 151
      flat  flat%   sum%        cum   cum%
18963.79kB 39.42% 39.42% 18963.79kB 39.42%  reflect.unsafe_New
 9040.26kB 18.79% 58.22%  9040.26kB 18.79%  github.com/NCAR/ACS/vendor/github.com/mkevac/debugcharts.init.0
 3586.06kB  7.46% 65.67%  3586.06kB  7.46%  net/textproto.(*Reader).ReadMIMEHeader
 2626.67kB  5.46% 71.13%  2626.67kB  5.46%  regexp.(*bitState).reset
 2048.69kB  4.26% 75.39%  5120.98kB 10.65%  github.com/NCAR/ACS/vendor/github.com/gorilla/sessions.(*Registry).Get
 2048.25kB  4.26% 79.65%  2048.25kB  4.26%  database/sql.cloneBytes (inline)
 1536.31kB  3.19% 82.85%  1536.31kB  3.19%  net/http.(*Request).WithContext
 1536.05kB  3.19% 86.04%  4096.31kB  8.52%  database/sql.convertAssign
 1024.05kB  2.13% 88.17%  1024.05kB  2.13%  github.com/NCAR/ACS/vendor/github.com/gorilla/sessions.NewSession (inline)
  561.50kB  1.17% 89.34%   561.50kB  1.17%  html.init
  520.04kB  1.08% 90.42%   520.04kB  1.08%  regexp.(*bitState).push
  512.88kB  1.07% 91.48%   512.88kB  1.07%  sync.(*Map).LoadOrStore
  512.38kB  1.07% 92.55%   512.38kB  1.07%  regexp.onePassCopy
  512.14kB  1.06% 93.61%   512.14kB  1.06%  github.com/NCAR/ACS/vendor/github.com/gorilla/context.Set
  512.14kB  1.06% 94.68%   512.14kB  1.06%  github.com/NCAR/ACS/vendor/github.com/jinzhu/gorm.parseTagSetting
  512.14kB  1.06% 95.74%   512.14kB  1.06%  reflect.mapassign
  512.05kB  1.06% 96.81%   512.05kB  1.06%  context.(*cancelCtx).Done
  512.05kB  1.06% 97.87%   512.05kB  1.06%  net/http.(*persistConn).roundTrip
  512.02kB  1.06% 98.94%   512.02kB  1.06%  encoding/gob.decString
  512.01kB  1.06%   100%   512.01kB  1.06%  database/sql.asString

Nothing too suprising here. Using ~48MB ram. reflect.unsafe_New is dominating. What is calling reflect.unsafe_New so often? We can use (pprof) web or (pprof) png to get an edge weighted call directed call graph.

(pprof) web

Neato! Well, reflect.unsafe_New belongs strictly to gorm - the database library. Ok. So that makes sense. gorm uses reflect to fetch struct tags and whatnot to store and retrieve data from the database - that is the whole point of the package after all. Looking back over the graph and output of top, it’s not clear at start what is leaking memory - but that isn’t exactly surprising: leaks happen over time. Let’s give it some time. From previous experience, the leak becomes more obvious after ~20 minutes.

Take another snapshot. Rinse and repeat. Something is slightly different:

λ go tool pprof 1c095e0818f595cf7e6f065334cda1aadfbc09ec-3/heap.pb.gz
File: acs-daemon
Build ID: 9f52a8408ff14ff129c7b44bfd637d7f350a1139
Type: inuse_space
Time: Jul 16, 2018 at 2:34pm (MDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 20
Showing nodes accounting for 72.50MB, 94.16% of 77MB total
Showing top 20 nodes out of 176
      flat  flat%   sum%        cum   cum%
   19.02MB 24.70% 24.70%    19.02MB 24.70%  reflect.unsafe_New
   14.01MB 18.19% 42.89%    14.01MB 18.19%  net/textproto.(*Reader).ReadMIMEHeader
    8.83MB 11.46% 54.35%     8.83MB 11.46%  github.com/NCAR/ACS/vendor/github.com/mkevac/debugcharts.init.0
    5.50MB  7.14% 61.50%     5.50MB  7.14%  net/http.(*Request).WithContext
       5MB  6.50% 67.99%    12.50MB 16.24%  github.com/NCAR/ACS/vendor/github.com/gorilla/sessions.(*Registry).Get
    3.08MB  4.00% 71.99%     3.08MB  4.00%  regexp.(*bitState).reset
       3MB  3.90% 75.89%        3MB  3.90%  github.com/NCAR/ACS/vendor/github.com/gorilla/context.Set
       2MB  2.60% 78.49%        2MB  2.60%  reflect.mapassign
       2MB  2.60% 81.09%        2MB  2.60%  database/sql.cloneBytes (inline)
    1.50MB  1.95% 83.04%     1.50MB  1.95%  github.com/NCAR/ACS/vendor/github.com/gorilla/sessions.NewSession (inline)
    1.50MB  1.95% 84.98%     1.50MB  1.95%  net.sockaddrToTCP
    1.50MB  1.95% 86.93%        4MB  5.19%  database/sql.convertAssign
    1.50MB  1.95% 88.88%     1.50MB  1.95%  encoding/gob.decString
       1MB  1.30% 90.18%     7.50MB  9.74%  github.com/NCAR/ACS/vendor/github.com/gorilla/sessions.(*CookieStore).New
    0.55MB  0.71% 90.89%     0.55MB  0.71%  html.init
    0.51MB  0.66% 91.55%     0.51MB  0.66%  regexp.(*bitState).push
    0.50MB  0.65% 92.20%     0.50MB  0.65%  github.com/NCAR/ACS/structs/speca.NewFrame
    0.50MB  0.65% 92.85%     0.50MB  0.65%  github.com/NCAR/ACS/structs/speca.Sweep.Maximum
    0.50MB  0.65% 93.51%     0.50MB  0.65%  sync.(*Map).LoadOrStore
    0.50MB  0.65% 94.16%     0.50MB  0.65%  bytes.makeSlice

We are now using 72.5MB of RAM. Ouch. It looks like ReadMIMEHeader is growing, and components of various gorilla components are now prominent. We could spend time flipping back and forth between the outputs of sample 1 and sample 2, but it would be better to just use functionality the tool provides. There is a --base flag that tells the pprof to use one sample as the basis and then deal with only the diffs. How handy! It’s like the tool was built for doing this sort of thing.

λ go tool pprof -base heap.pb.gz ../1c095e0818f595cf7e6f065334cda1aadfbc09ec-3/heap.pb.gz
File: acs-daemon
Build ID: 9f52a8408ff14ff129c7b44bfd637d7f350a1139
Type: inuse_space
Time: Jul 16, 2018 at 2:05pm (MDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top 20
Showing nodes accounting for 30750.61kB, 96.78% of 31774.71kB total
Dropped 2 nodes (cum <= 158.87kB)
Showing top 20 nodes out of 103
      flat  flat%   sum%        cum   cum%
10757.38kB 33.86% 33.86% 10757.38kB 33.86%  net/textproto.(*Reader).ReadMIMEHeader
 4096.94kB 12.89% 46.75%  4096.94kB 12.89%  net/http.(*Request).WithContext
 3073.03kB  9.67% 56.42%  7681.58kB 24.18%  github.com/NCAR/ACS/vendor/github.com/gorilla/sessions.(*Registry).Get
 2560.59kB  8.06% 64.48%  2560.59kB  8.06%  github.com/NCAR/ACS/vendor/github.com/gorilla/context.Set
 1536.42kB  4.84% 69.31%  1536.42kB  4.84%  reflect.mapassign
 1536.07kB  4.83% 74.15%  1536.07kB  4.83%  net.sockaddrToTCP
 1024.05kB  3.22% 77.37%  4608.55kB 14.50%  github.com/NCAR/ACS/vendor/github.com/gorilla/sessions.(*CookieStore).New
 1024.03kB  3.22% 80.59%  1024.03kB  3.22%  encoding/gob.decString
  528.17kB  1.66% 82.26%   528.17kB  1.66%  regexp.(*bitState).reset
  514.63kB  1.62% 83.88%   514.63kB  1.62%  github.com/NCAR/ACS/structs/speca.NewFrame
  514.63kB  1.62% 85.50%   514.63kB  1.62%  github.com/NCAR/ACS/structs/speca.Sweep.Maximum
  512.56kB  1.61% 87.11%   512.56kB  1.61%  bytes.makeSlice
 -512.05kB  1.61% 85.50%  -512.05kB  1.61%  net/http.(*persistConn).roundTrip
  512.04kB  1.61% 87.11%   512.04kB  1.61%  github.com/NCAR/ACS/vendor/github.com/gorilla/sessions.NewSession (inline)
  512.03kB  1.61% 88.72%   512.03kB  1.61%  context.WithCancel
  512.02kB  1.61% 90.33%   512.02kB  1.61%  context.WithValue
  512.02kB  1.61% 91.94%  3072.61kB  9.67%  github.com/NCAR/ACS/vendor/github.com/gorilla/sessions.GetRegistry
  512.02kB  1.61% 93.55%   512.02kB  1.61%  runtime.systemstack
  512.02kB  1.61% 95.17%   512.02kB  1.61%  net/textproto.(*Reader).ReadLine
  512.01kB  1.61% 96.78%   512.01kB  1.61%  reflect.unsafe_New

From top we see that acs has scarfed up another 30MBs, but a large bulk of this is taken by gorilla packages. Interesting. Given that gorilla/sessions is only directly used by one package in acs: routes/auth as part of handling session cookies. routes.auth was written pretty early on before more experience with the HTTP protocol. It uses session strictly for shoveling User data into the request, and probably isn’t the best way. These discoveries lead to reworking the auth package to remove the need to use gorilla/session. gorilla/securecookie and gorilla/context, which has been mostly superceded by the standard libraries context.

Hopefully this has given enough information on how to use pprof to aid in checking for memory leaks. As a useful note, the difference between working with heap and profile (CPU usage) snapshots is strictly the name. The tooling works the same.