Faster command line tools with Go
Update
After some input from u/epiris on Reddit, I improved the code a bit further by changing the way bytestreams are scanned. Current fastest runtime is 0.308s for 10,512,769 rows, or about 34 Million rows per second. Since the file is 184 Megabytes, this is a processing speed of about 600 Megabytes per second, which is probably close to the read speed limit of my SSD.
It all started with a blog post, Faster Command Line Tools in D. It was at the top of Hacker News and got picked up by other sites as well. Then Euan Torano wrote Faster Command Line Tools in Nim. After reading about both, I found some comments in the Reddit post on the same topic, but in Go. Go is a great tool for many things, but I wasn’t sure what kind of performance I’d get out of this kind of tool.
TL;DR: I wrote a special-case version of the utility which processes the input in about 0.482 seconds, or about 22.8 million lines per second on my laptop, which is a processing rate of about 390 Megabytes per second.
The approximate performance of the other languages was as follows:
Python: 15 seconds
D (DMD): 2.4 seconds
D (LDC): 1.4 seconds
Nim: 1.4 seconds
My current-fastest Go version: 0.338s
Problem statement
The original version of the problem, as outlined in the post on D, was the following:
It’s a common programming task: Take a data file with fields separated by a delimiter (comma, tab, etc), and run a mathematical calculation involving several of the fields. Often these programs are one-time use scripts, other times they have longer shelf life. Speed is of course appreciated when the program is used more than a few times on large files.
The specific exercise we’ll explore starts with files having keys in one field, integer values in another. … Fields are delimited by a TAB, and there may be any number of fields on a line. The file name and field numbers of the key and value are passed as command line arguments.
Okay. We’ll have a delimited file (with tabs in this case), and we want to provide the filename, index for the key column, and index for the value column to our code. With those, our code should compute the sum of all the values for each specific key, and then output the key having the largest value. In SQL terms, this is a GROUP BY
and a MAX
. (Note that I did not do a version of this problem in SQL, but from what I’ve seen it is slower than the Go/Nim/D versions, though perhaps not the Python version.)
The data we will use is a file of n-grams from the Google Books dataset. The file is 184 Megabytes, uncompressed, and has a total of 10,512,769 lines.
First attempt
As a first try, we’ll use all the built-in Go libraries for string and file processing. We’ll keep the code general and high-level, and see what kind of speed we get.
Since I like to start with data structures, we’ll just use a simple map[string]int
in this case, with the string keys being the keys from our file, and the integer values being the sum of all the values in the file for that particular key. In other words, it’s a for
loop over the rows in the file, and an increment of a map value each time we go through the loop. Then at the end we’ll do a for loop over all the elements in the map and see which key has the largest value.
package main
import (
"bufio"
"flag"
"fmt"
"log"
"math"
"os"
"strconv"
"strings"
)
func processFile(filePath string, keyIndex, valueIndex int) {
delim := "\t"
fileHandle, err := os.Open(filePath)
defer fileHandle.Close()
maxFieldIndex := int(math.Max(float64(keyIndex), float64(valueIndex)))
sumByKey := make(map[string]int)
if err != nil {
log.Fatal(err)
}
fileReader := bufio.NewScanner(fileHandle)
for fileReader.Scan() {
fields := strings.Split(strings.TrimSpace(fileReader.Text()), delim)
if maxFieldIndex < len(fields) {
value, err := strconv.Atoi(fields[valueIndex])
if err != nil {
log.Fatal(err)
}
sumByKey[fields[keyIndex]] += value
}
}
maxValue := 0
maxKey := ""
for k, v := range sumByKey {
if v > maxValue {
maxValue = v
maxKey = k
}
}
fmt.Println("max_key:", maxKey, "sum:", sumByKey[maxKey])
}
func main() {
filePath := flag.String("filePath", "", "Name of the file to parse")
keyIndex := flag.Int("keyIndex", 0, "Index of key (0 is first position)")
valueIndex := flag.Int("valueIndex", 0, "Index of value (0 is first position)")
flag.Parse()
processFile(*filePath, *keyIndex, *valueIndex)
}
The processFile()
function does all the work, and that makes it easy for us to write a benchmark.
package main
import (
"testing"
)
func Benchmark_processFile(b *testing.B) {
for i := 0; i < b.N; i++ {
processFile("../ngrams.tsv", 1, 2)
}
}
Results
The first version works, but it’s not very fast. We can see its performance by running go test -cpuprofile cpu.prof -memprofile mem.prof -bench .
both to get a benchmark, and to see which parts are slow.
max_key: 2006 sum: 22569013
Benchmark_processFile-4 1 3486100959 ns/op
PASS
ok github.com/adamdrake/faster-command-line-tools-in-nim/Go/v1 3.491s
Our first version takes about 3.49 seconds to run, which isn’t bad, but isn’t great. Why so slow? Let’s go tool pprof cpu.prof
and find out.
Entering interactive mode (type "help" for commands)
(pprof) text
2250ms of 3490ms total (64.47%)
Dropped 33 nodes (cum <= 17.45ms)
Showing top 10 nodes out of 65 (cum >= 160ms)
flat flat% sum% cum cum%
600ms 17.19% 17.19% 1530ms 43.84% strings.genSplit
330ms 9.46% 26.65% 790ms 22.64% runtime.mallocgc
260ms 7.45% 34.10% 260ms 7.45% runtime.heapBitsSetType
200ms 5.73% 39.83% 200ms 5.73% runtime.indexbytebody
200ms 5.73% 45.56% 310ms 8.88% runtime.mapaccess1_faststr
170ms 4.87% 50.43% 410ms 11.75% runtime.mapassign
160ms 4.58% 55.01% 160ms 4.58% runtime.aeshashbody
120ms 3.44% 58.45% 400ms 11.46% strings.Count
110ms 3.15% 61.60% 3440ms 98.57% github.com/adamdrake/faster-command-line-tools-in-nim/Go/v1.processFile
100ms 2.87% 64.47% 160ms 4.58% strconv.ParseInt
It seems like we spend a lot of time on genSplit()
, which is the General Split command invoked by strings.Split()
in our code. To see this, we can type list processFile
in pprof
to get the line-by-line timing for that function. Here’s a selection of the output of list processFile
.
. 240ms 29: for fileReader.Scan() {
. 2.19s 30: fields := strings.Split(strings.TrimSpace(fileReader.Text()), delim)
. . 31:
. . 32: if maxFieldIndex < len(fields) {
10ms 190ms 33: value, err := strconv.Atoi(fields[valueIndex])
. . 34: if err != nil {
. . 35: log.Fatal(err)
. . 36: }
80ms 800ms 37: sumByKey[fields[keyIndex]] += value
. . 38: }
. . 39: }
We see that out of the 3.49s we’re spending processing our file, 2.19s of that is spent just splitting strings. To confirm that genSplit()
is in strings.Split()
we can do the same thing with list Split
and get all the details on where the time is going.
Let’s assume that we aren’t going to dive in and start making changes to the standard library. What should we do to improve the speed? The natural thing for most Go programmers is to turn to channels and goroutines. Spoiler alert: that is slower.
V2: Channels and Goroutines
Channels and goroutines are great, but often overused. They have a startup cost, and many people are not aware of the fact that channels are a data structure that includes a mutex, and therefore subject to lock contention (i.e., they can be slow). Regardless, let’s see how the performance fares if we go that route. We’ll keep the map and other data in a struct
, which we’ll protect with a sync.Mutex
so that we have thread-safe write access to the map (recall: in Go, a map
is NOT thread-safe).
package main
import (
"flag"
"fmt"
"log"
"math"
"strconv"
"strings"
"sync"
fstream "github.com/adamdrake/gofstream"
)
type kv struct {
keyIndex, valueIndex int
store map[string]int
sync.Mutex
}
func worker(rows chan string, data *kv, wg *sync.WaitGroup) {
defer wg.Done()
keyIndex := data.keyIndex
valueIndex := data.valueIndex
maxFieldIndex := int(math.Max(float64(keyIndex), float64(valueIndex)))
sumByKey := make(map[string]int)
for r := range rows {
fields := strings.Split(strings.TrimSpace(r), "\t")
if maxFieldIndex < len(fields) {
value, err := strconv.Atoi(fields[data.valueIndex])
if err != nil {
log.Fatal(err)
}
sumByKey[fields[keyIndex]] += value
}
}
data.Lock()
for k, v := range sumByKey {
data.store[k] += v
}
data.Unlock()
}
func processFile(filePath string, keyIndex, valueIndex int) {
rows, err := fstream.New(filePath, 100000)
if err != nil {
log.Fatal(err)
}
data := &kv{keyIndex: keyIndex, valueIndex: valueIndex, store: make(map[string]int)}
wg := &sync.WaitGroup{}
for i := 0; i < 4; i++ {
wg.Add(1)
go worker(rows, data, wg)
}
wg.Wait()
maxValue := 0
maxKey := ""
for k, v := range data.store {
if v > maxValue {
maxValue = v
maxKey = k
}
}
fmt.Println("max_key:", maxKey, "sum:", data.store[maxKey])
}
func main() {
filePath := flag.String("filePath", "", "Name of the file to parse")
keyIndex := flag.Int("keyIndex", 0, "Index of key (0 is first position)")
valueIndex := flag.Int("valueIndex", 0, "Index of value (0 is first position)")
flag.Parse()
processFile(*filePath, *keyIndex, *valueIndex)
}
This code has the same processFile()
function as before, but this time it creates a streaming reader for the file, and streams each row of the file over a channel. We then give the channel to a worker thread, and the workers do the same thing we did before. We also have some additional complexity in the form of a sync.WaitGroup
to make sure we actually wait and allow the workers to do their job before main()
returns. What’s the performance? Not great. In fact, it’s worse, at about 4.5s.
max_key: 2006 sum: 22569013
Benchmark_processFile-4 1 4476065569 ns/op
PASS
ok github.com/adamdrake/faster-command-line-tools-in-nim/Go/v4 4.482s
I won’t go into the pprof
output for this one, but you can see that the reason it’s slower is because of runtime.procyield
, due to all our goroutines. I probably could have tried to do some things with sync/atomic
instead of using a mutex on the struct, but spinning up goroutines for such a small file doesn’t make much sense in this case anyway. If the file was larger, perhaps we’d get the benefit of using all cores, but for input of this size it just isn’t worth it.
V4: Stop using strings
Since it seems that a lot of the overhead is in parsing and splitting strings, why not just operate on the underlying bytes instead? Credit to valyala for their version, which reads and splits on bytes
instead of string
. Here’s my version of their code, with an additional twist from a Stack Overflow post which allows us to use a faster string to integer parsing method, assuming that the input values we are summing are always positive integers.
package main
import (
"bufio"
"bytes"
"errors"
"fmt"
"os"
"unsafe"
)
func processFile(file *os.File, keyField, valueField int) {
var sumByKey = make(map[string]int)
maxField := keyField
if valueField > maxField {
maxField = valueField
}
scanner := bufio.NewScanner(file)
for scanner.Scan() {
line := scanner.Bytes()
key, val := getKeyVal(line, keyField, valueField, maxField)
sumByKey[key] += val
}
var k string
v := -1
for key, val := range sumByKey {
if val > v {
k = key
v = val
}
}
fmt.Printf("max_key: %s sum: %d", k, v)
}
func main() {
file, _ := os.Open(os.Args[1])
defer file.Close()
keyField, _ := atoi([]byte(os.Args[2]))
valueField, _ := atoi([]byte(os.Args[3]))
processFile(file, keyField, valueField)
}
func getKeyVal(line []byte, keyField, valueField, maxField int) (string, int) {
var i int
var tabIndex int
var k []byte
var v []byte
var field []byte
for i <= maxField && len(line) > 0 {
tabIndex = bytes.IndexByte(line, '\t') // returns -1 if not found
if tabIndex < 0 {
field = line
line = nil
} else {
field = line[:tabIndex]
line = line[tabIndex+1:]
}
switch i {
case keyField:
k = field
case valueField:
v = field
}
i++
}
val, _ := atoi(v)
return string(k), val
}
var errAtoi = errors.New("invalid number")
func atoi(input []byte) (int, error) {
val := 0
for i := 0; i < len(input); i++ {
char := input[i]
if char < '0' || char > '9' {
return 0, errAtoi
}
val = val*10 + int(char) - '0'
}
return val, nil
}
Now, instead of reading in the line as a string and then splitting the string on the delimiter with strings.Split()
, we’re reading in the line as a bytestring, and using the bytes.IndexByte()
function to find the index of the delimiter in the bytestring. We loop through the bytestring by finding the current delimiter (which corresponds to the command line arguments we received for key index and value index), and sum the values as always. How much do we save by operating on the bytes instead of the strings? A lot.
max_key: 2006 sum: 22569013
Benchmark_processFile-4 1 1517321373 ns/op
PASS
Great! This approach cut our runtime by about 60%, and it’s still pretty usable as production code. It’s not too customized, it still fits the requirements of the original problem definition of key column and value column being command line arguments, and it’s pretty readable. Where is the time mostly spent?
(pprof) text
1200ms of 1500ms total (80.00%)
Showing top 10 nodes out of 41 (cum >= 50ms)
flat flat% sum% cum cum%
260ms 17.33% 17.33% 670ms 44.67% github.com/adamdrake/faster-command-line-tools-in-nim/Go/v5.getKeyVal
200ms 13.33% 30.67% 330ms 22.00% runtime.mapaccess1_faststr
170ms 11.33% 42.00% 170ms 11.33% runtime.indexbytebody
120ms 8.00% 50.00% 260ms 17.33% runtime.mapassign
100ms 6.67% 56.67% 100ms 6.67% runtime.aeshashbody
90ms 6.00% 62.67% 120ms 8.00% runtime.mallocgc
70ms 4.67% 67.33% 1500ms 100% github.com/adamdrake/faster-command-line-tools-in-nim/Go/v5.processFile
70ms 4.67% 72.00% 70ms 4.67% runtime.memeqbody
70ms 4.67% 76.67% 80ms 5.33% syscall.Syscall
50ms 3.33% 80.00% 50ms 3.33% github.com/adamdrake/faster-command-line-tools-in-nim/Go/v5.atoi
Most of the time is spent splitting the bytes and getting the key/value pairs, as expected. Let’s list getKeyVal
to take a closer look.
260ms 670ms (flat, cum) 44.67% of Total
. . 51: var tabIndex int
. . 52: var k []byte
. . 53: var v []byte
. . 54: var field []byte
. . 55:
20ms 20ms 56: for i <= maxField && len(line) > 0 {
20ms 160ms 57: tabIndex = bytes.IndexByte(line, '\t') // returns -1 if not found
20ms 20ms 58: if tabIndex < 0 {
. . 59: field = line
. . 60: line = nil
. . 61: } else {
10ms 10ms 62: field = line[:tabIndex]
60ms 60ms 63: line = line[tabIndex+1:]
. . 64: }
. . 65: switch i {
10ms 10ms 66: case keyField:
. . 67: k = field
10ms 10ms 68: case valueField:
. . 69: v = field
. . 70: }
10ms 10ms 71: i++
. . 72: }
40ms 90ms 73: val, _ := atoi(v)
60ms 280ms 74: return string(k), val
. . 75:}
We can see that the most time is spent getting the index of the bytes, and some on string casting for the key (so that we can use it as the map key). In a practical scenario, we’d probably stop here since, as mentioned above, this is probably about as fast as the code will get while still being sufficiently general, readable, etc. We can do better though, if we break the requirements of the original comparison.
V5: Let’s go nuts
Let’s put some of the practical concerns aside for the moment and focus on what we know about the problem. First, we are trusting our input data, which we’d normally be more careful about in a real-world situation. With that in mind, we can change the definitions of our atoi()
function so that it doesn’t do any error checking, and so that it only operates on []byte
and returns int
. Here’s what the simplified atoi
looks like.
func atoi(s []byte) int {
i := 0
x := 0
for ; i < len(s); i++ {
c := s[i]
x = x*10 + int(c) - '0'
}
return x
}
We also know that the keys and values are always integers, so we could use an int
type instead of string
in the map keys, and we could also remove the string(k)
cast in the getKeyVal()
implementation above. However, we know in this case that we want the value between the first and second separator, and between the second and third separator, every time. So in this contrived instance, we don’t actually need to iterate and check delimiters. We can just treat the entire line like the bytestring it is, and only find the indexes for the first, second, and third instances of the delimiter.
Another important fact is that, for this dataset, we know the largest key. Therefore, instead of a map that we previously saw was costing us about 800ms to increment, we can use an array. In this case, our integer key will be the index, and the value at that index will be the sum for that key. With that in mind, we have a highly customised, but SUPER FAST version below.
func main() {
file, err := os.Open(os.Args[1])
defer file.Close()
if err != nil {
log.Fatal(err)
}
processFile(file)
}
func processFile(file *os.File) {
var sumByKey [2009]int
scanner := bufio.NewScanner(file)
var key int
var val int
for scanner.Scan() {
line := scanner.Bytes()
firstTab := bytes.IndexByte(line, '\t')
secondTab := bytes.IndexByte(line[firstTab+1:], '\t') + firstTab + 1
thirdTab := bytes.IndexByte(line[secondTab+1:], '\t') + secondTab + 1
key = atoi(line[firstTab+1 : secondTab])
val = atoi(line[secondTab+1 : thirdTab])
sumByKey[key] += val
}
var k int
var v int
for i, val := range sumByKey {
if val > v {
k = i
v = val
}
}
fmt.Printf("max_key: %d sum: %d\n", k, v)
}
What’s the pprof
result?
Benchmark_processFile-4 max_key: 2006 sum: 22569013
max_key: 2006 sum: 22569013
max_key: 2006 sum: 22569013
max_key: 2006 sum: 22569013
max_key: 2006 sum: 22569013
3 482369483 ns/op
It’s crazy fast! This version finishes in 0.482s (482ms) on my machine. Where is the most time spent now? (Note that because the benchmark ran multiple times in order to get a stable timing for the run, the pprof
results are the sum of all the runs.)
20ms 1.26s 38: for scanner.Scan() {
70ms 70ms 39: line := scanner.Bytes()
40ms 200ms 40: firstTab := bytes.IndexByte(line, '\t')
190ms 330ms 41: secondTab := bytes.IndexByte(line[firstTab+1:], '\t') + firstTab + 1
260ms 610ms 42: thirdTab := bytes.IndexByte(line[secondTab+1:], '\t') + secondTab + 1
80ms 160ms 43: key = atoi(line[firstTab+1 : secondTab])
40ms 180ms 44: val = atoi(line[secondTab+1 : thirdTab])
70ms 70ms 45: sumByKey[key] += val
. . 46: }
Most of the time is spent indexing into the bytestring, which is what we expected. There may be some clever ways to reduce this, in addition to some data structures which would allow us to do the incrementing faster, but at this point we’re pretty far down the path of optimizing past production usage, and hand-tuned Fortran/assembly isn’t really the goal. In the end though, we processed a 184MB file, with 10,512,769 rows, in 0.482s.
Happy optimizing!
Updated code:
Since the article was posted on Reddit and Hackernews, there was a user (Epiris) on Reddit who suggested some improvements. After their suggestions, below is the code and benchmarks for the updated version. It’s important to note that a personal requirement I have for the code is that it cannot read all data into memory, since we must assume that the data stream could be arbitrarily large.
func processLine(b []byte) (int, int) {
key := 0
val := 0
i := 0
for b[i] != '\t' {
i++
}
for i++; b[i] != '\t'; i++ {
key = key*10 + int(b[i]) - '0'
}
for i++; b[i] != '\t'; i++ {
val = val*10 + int(b[i]) - '0'
}
return key, val
}
func processFile(file *os.File) (int, int) {
var sumByKey [2009]int
scanner := bufio.NewScanner(file)
for scanner.Scan() {
line := scanner.Bytes()
k1, v1 := processLine(line)
sumByKey[k1] += v1
}
var k int
var v int
for i, val := range sumByKey {
if val > v {
k = i
v = val
}
}
return k, v
}
And here is the associated benchmark.
func Benchmark_processFile(b *testing.B) {
file, err := os.Open("../ngrams.tsv")
defer file.Close()
if err != nil {
b.Fatal(err)
}
data, err := ioutil.ReadAll(file)
if err != nil {
b.Fatal(err)
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
k, v := processFile(data)
if k != 2006 || v != 22569013 {
b.Fatalf(`bad result %v | %v`, k, v)
}
}
}
The output of the benchmarking:
Benchmark_processFile-4 5 309631168 ns/op 4152 B/op 4 allocs/op