Faster command line tools with Go

Update

After some input from u/epiris on Reddit, I improved the code a bit further by changing the way bytestreams are scanned. Current fastest runtime is 0.308s for 10,512,769 rows, or about 34 Million rows per second. Since the file is 184 Megabytes, this is a processing speed of about 600 Megabytes per second, which is probably close to the read speed limit of my SSD.

It all started with a blog post, Faster Command Line Tools in D. It was at the top of Hacker News and got picked up by other sites as well. Then Euan Torano wrote Faster Command Line Tools in Nim. After reading about both, I found some comments in the Reddit post on the same topic, but in Go. Go is a great tool for many things, but I wasn’t sure what kind of performance I’d get out of this kind of tool.

TL;DR: I wrote a special-case version of the utility which processes the input in about 0.482 seconds, or about 22.8 million lines per second on my laptop, which is a processing rate of about 390 Megabytes per second.

The approximate performance of the other languages was as follows:

Python: 15 seconds
D (DMD): 2.4 seconds
D (LDC): 1.4 seconds
Nim: 1.4 seconds
My current-fastest Go version: 0.338s

Problem statement

The original version of the problem, as outlined in the post on D, was the following:

It’s a common programming task: Take a data file with fields separated by a delimiter (comma, tab, etc), and run a mathematical calculation involving several of the fields. Often these programs are one-time use scripts, other times they have longer shelf life. Speed is of course appreciated when the program is used more than a few times on large files.
The specific exercise we’ll explore starts with files having keys in one field, integer values in another. … Fields are delimited by a TAB, and there may be any number of fields on a line. The file name and field numbers of the key and value are passed as command line arguments.

Okay. We’ll have a delimited file (with tabs in this case), and we want to provide the filename, index for the key column, and index for the value column to our code. With those, our code should compute the sum of all the values for each specific key, and then output the key having the largest value. In SQL terms, this is a GROUP BY and a MAX. (Note that I did not do a version of this problem in SQL, but from what I’ve seen it is slower than the Go/Nim/D versions, though perhaps not the Python version.)

The data we will use is a file of n-grams from the Google Books dataset. The file is 184 Megabytes, uncompressed, and has a total of 10,512,769 lines.

First attempt

As a first try, we’ll use all the built-in Go libraries for string and file processing. We’ll keep the code general and high-level, and see what kind of speed we get.

Since I like to start with data structures, we’ll just use a simple map[string]int in this case, with the string keys being the keys from our file, and the integer values being the sum of all the values in the file for that particular key. In other words, it’s a for loop over the rows in the file, and an increment of a map value each time we go through the loop. Then at the end we’ll do a for loop over all the elements in the map and see which key has the largest value.

package main

import (
	"bufio"
	"flag"
	"fmt"
	"log"
	"math"
	"os"
	"strconv"
	"strings"
)

func processFile(filePath string, keyIndex, valueIndex int) {
	delim := "\t"

	fileHandle, err := os.Open(filePath)
	defer fileHandle.Close()

	maxFieldIndex := int(math.Max(float64(keyIndex), float64(valueIndex)))
	sumByKey := make(map[string]int)

	if err != nil {
		log.Fatal(err)
	}

	fileReader := bufio.NewScanner(fileHandle)

	for fileReader.Scan() {
		fields := strings.Split(strings.TrimSpace(fileReader.Text()), delim)

		if maxFieldIndex < len(fields) {
			value, err := strconv.Atoi(fields[valueIndex])
			if err != nil {
				log.Fatal(err)
			}
			sumByKey[fields[keyIndex]] += value
		}
	}

	maxValue := 0
	maxKey := ""
	for k, v := range sumByKey {
		if v > maxValue {
			maxValue = v
			maxKey = k
		}
	}
	fmt.Println("max_key:", maxKey, "sum:", sumByKey[maxKey])

}

func main() {
	filePath := flag.String("filePath", "", "Name of the file to parse")
	keyIndex := flag.Int("keyIndex", 0, "Index of key (0 is first position)")
	valueIndex := flag.Int("valueIndex", 0, "Index of value (0 is first position)")

	flag.Parse()
	processFile(*filePath, *keyIndex, *valueIndex)

}

The processFile() function does all the work, and that makes it easy for us to write a benchmark.

package main

import (
	"testing"
)

func Benchmark_processFile(b *testing.B) {
	for i := 0; i < b.N; i++ {
		processFile("../ngrams.tsv", 1, 2)
	}
}

Results

The first version works, but it’s not very fast. We can see its performance by running go test -cpuprofile cpu.prof -memprofile mem.prof -bench . both to get a benchmark, and to see which parts are slow.

max_key: 2006 sum: 22569013
Benchmark_processFile-4   	       1	3486100959 ns/op
PASS
ok  	github.com/adamdrake/faster-command-line-tools-in-nim/Go/v1	3.491s

Our first version takes about 3.49 seconds to run, which isn’t bad, but isn’t great. Why so slow? Let’s go tool pprof cpu.prof and find out.

Entering interactive mode (type "help" for commands)
(pprof) text
2250ms of 3490ms total (64.47%)
Dropped 33 nodes (cum <= 17.45ms)
Showing top 10 nodes out of 65 (cum >= 160ms)
      flat  flat%   sum%        cum   cum%
     600ms 17.19% 17.19%     1530ms 43.84%  strings.genSplit
     330ms  9.46% 26.65%      790ms 22.64%  runtime.mallocgc
     260ms  7.45% 34.10%      260ms  7.45%  runtime.heapBitsSetType
     200ms  5.73% 39.83%      200ms  5.73%  runtime.indexbytebody
     200ms  5.73% 45.56%      310ms  8.88%  runtime.mapaccess1_faststr
     170ms  4.87% 50.43%      410ms 11.75%  runtime.mapassign
     160ms  4.58% 55.01%      160ms  4.58%  runtime.aeshashbody
     120ms  3.44% 58.45%      400ms 11.46%  strings.Count
     110ms  3.15% 61.60%     3440ms 98.57%  github.com/adamdrake/faster-command-line-tools-in-nim/Go/v1.processFile
     100ms  2.87% 64.47%      160ms  4.58%  strconv.ParseInt

It seems like we spend a lot of time on genSplit(), which is the General Split command invoked by strings.Split() in our code. To see this, we can type list processFile in pprof to get the line-by-line timing for that function. Here’s a selection of the output of list processFile.

         .      240ms     29:	for fileReader.Scan() {
         .      2.19s     30:		fields := strings.Split(strings.TrimSpace(fileReader.Text()), delim)
         .          .     31:
         .          .     32:		if maxFieldIndex < len(fields) {
      10ms      190ms     33:			value, err := strconv.Atoi(fields[valueIndex])
         .          .     34:			if err != nil {
         .          .     35:				log.Fatal(err)
         .          .     36:			}
      80ms      800ms     37:			sumByKey[fields[keyIndex]] += value
         .          .     38:		}
         .          .     39:	}

We see that out of the 3.49s we’re spending processing our file, 2.19s of that is spent just splitting strings. To confirm that genSplit() is in strings.Split() we can do the same thing with list Split and get all the details on where the time is going.

Let’s assume that we aren’t going to dive in and start making changes to the standard library. What should we do to improve the speed? The natural thing for most Go programmers is to turn to channels and goroutines. Spoiler alert: that is slower.

V2: Channels and Goroutines

Channels and goroutines are great, but often overused. They have a startup cost, and many people are not aware of the fact that channels are a data structure that includes a mutex, and therefore subject to lock contention (i.e., they can be slow). Regardless, let’s see how the performance fares if we go that route. We’ll keep the map and other data in a struct, which we’ll protect with a sync.Mutex so that we have thread-safe write access to the map (recall: in Go, a map is NOT thread-safe).

package main

import (
	"flag"
	"fmt"
	"log"
	"math"
	"strconv"
	"strings"
	"sync"

	fstream "github.com/adamdrake/gofstream"
)

type kv struct {
	keyIndex, valueIndex int
	store                map[string]int
	sync.Mutex
}

func worker(rows chan string, data *kv, wg *sync.WaitGroup) {
	defer wg.Done()
	keyIndex := data.keyIndex
	valueIndex := data.valueIndex
	maxFieldIndex := int(math.Max(float64(keyIndex), float64(valueIndex)))
	sumByKey := make(map[string]int)
	for r := range rows {
		fields := strings.Split(strings.TrimSpace(r), "\t")

		if maxFieldIndex < len(fields) {
			value, err := strconv.Atoi(fields[data.valueIndex])
			if err != nil {
				log.Fatal(err)
			}
			sumByKey[fields[keyIndex]] += value
		}
	}
	data.Lock()
	for k, v := range sumByKey {
		data.store[k] += v

	}
	data.Unlock()
}

func processFile(filePath string, keyIndex, valueIndex int) {

	rows, err := fstream.New(filePath, 100000)
	if err != nil {
		log.Fatal(err)
	}

	data := &kv{keyIndex: keyIndex, valueIndex: valueIndex, store: make(map[string]int)}
	wg := &sync.WaitGroup{}
	for i := 0; i < 4; i++ {
		wg.Add(1)
		go worker(rows, data, wg)
	}
	wg.Wait()

	maxValue := 0
	maxKey := ""
	for k, v := range data.store {
		if v > maxValue {
			maxValue = v
			maxKey = k
		}
	}
	fmt.Println("max_key:", maxKey, "sum:", data.store[maxKey])

}

func main() {
	filePath := flag.String("filePath", "", "Name of the file to parse")
	keyIndex := flag.Int("keyIndex", 0, "Index of key (0 is first position)")
	valueIndex := flag.Int("valueIndex", 0, "Index of value (0 is first position)")

	flag.Parse()
	processFile(*filePath, *keyIndex, *valueIndex)

}

This code has the same processFile() function as before, but this time it creates a streaming reader for the file, and streams each row of the file over a channel. We then give the channel to a worker thread, and the workers do the same thing we did before. We also have some additional complexity in the form of a sync.WaitGroup to make sure we actually wait and allow the workers to do their job before main() returns. What’s the performance? Not great. In fact, it’s worse, at about 4.5s.

max_key: 2006 sum: 22569013
Benchmark_processFile-4   	       1	4476065569 ns/op
PASS
ok  	github.com/adamdrake/faster-command-line-tools-in-nim/Go/v4	4.482s

I won’t go into the pprof output for this one, but you can see that the reason it’s slower is because of runtime.procyield, due to all our goroutines. I probably could have tried to do some things with sync/atomic instead of using a mutex on the struct, but spinning up goroutines for such a small file doesn’t make much sense in this case anyway. If the file was larger, perhaps we’d get the benefit of using all cores, but for input of this size it just isn’t worth it.

V4: Stop using strings

Since it seems that a lot of the overhead is in parsing and splitting strings, why not just operate on the underlying bytes instead? Credit to valyala for their version, which reads and splits on bytes instead of string. Here’s my version of their code, with an additional twist from a Stack Overflow post which allows us to use a faster string to integer parsing method, assuming that the input values we are summing are always positive integers.

package main

import (
	"bufio"
	"bytes"
	"errors"
	"fmt"
	"os"
	"unsafe"
)

func processFile(file *os.File, keyField, valueField int) {
	var sumByKey = make(map[string]int)
	maxField := keyField
	if valueField > maxField {
		maxField = valueField
	}

	scanner := bufio.NewScanner(file)

	for scanner.Scan() {
		line := scanner.Bytes()
		key, val := getKeyVal(line, keyField, valueField, maxField)
		sumByKey[key] += val
	}
	var k string
	v := -1
	for key, val := range sumByKey {
		if val > v {
			k = key
			v = val
		}
	}
	fmt.Printf("max_key: %s sum: %d", k, v)

}

func main() {

	file, _ := os.Open(os.Args[1])
	defer file.Close()

	keyField, _ := atoi([]byte(os.Args[2]))
	valueField, _ := atoi([]byte(os.Args[3]))
	processFile(file, keyField, valueField)

}

func getKeyVal(line []byte, keyField, valueField, maxField int) (string, int) {
	var i int
	var tabIndex int
	var k []byte
	var v []byte
	var field []byte

	for i <= maxField && len(line) > 0 {
		tabIndex = bytes.IndexByte(line, '\t') // returns -1 if not found
		if tabIndex < 0 {
			field = line
			line = nil
		} else {
			field = line[:tabIndex]
			line = line[tabIndex+1:]
		}
		switch i {
		case keyField:
			k = field
		case valueField:
			v = field
		}
		i++
	}
	val, _ := atoi(v)
	return string(k), val
}

var errAtoi = errors.New("invalid number")

func atoi(input []byte) (int, error) {
	val := 0
	for i := 0; i < len(input); i++ {
		char := input[i]
		if char < '0' || char > '9' {
			return 0, errAtoi
		}
		val = val*10 + int(char) - '0'
	}
	return val, nil
}

Now, instead of reading in the line as a string and then splitting the string on the delimiter with strings.Split(), we’re reading in the line as a bytestring, and using the bytes.IndexByte() function to find the index of the delimiter in the bytestring. We loop through the bytestring by finding the current delimiter (which corresponds to the command line arguments we received for key index and value index), and sum the values as always. How much do we save by operating on the bytes instead of the strings? A lot.

max_key: 2006 sum: 22569013
Benchmark_processFile-4   	       1	1517321373 ns/op
PASS

Great! This approach cut our runtime by about 60%, and it’s still pretty usable as production code. It’s not too customized, it still fits the requirements of the original problem definition of key column and value column being command line arguments, and it’s pretty readable. Where is the time mostly spent?

(pprof) text
1200ms of 1500ms total (80.00%)
Showing top 10 nodes out of 41 (cum >= 50ms)
      flat  flat%   sum%        cum   cum%
     260ms 17.33% 17.33%      670ms 44.67%  github.com/adamdrake/faster-command-line-tools-in-nim/Go/v5.getKeyVal
     200ms 13.33% 30.67%      330ms 22.00%  runtime.mapaccess1_faststr
     170ms 11.33% 42.00%      170ms 11.33%  runtime.indexbytebody
     120ms  8.00% 50.00%      260ms 17.33%  runtime.mapassign
     100ms  6.67% 56.67%      100ms  6.67%  runtime.aeshashbody
      90ms  6.00% 62.67%      120ms  8.00%  runtime.mallocgc
      70ms  4.67% 67.33%     1500ms   100%  github.com/adamdrake/faster-command-line-tools-in-nim/Go/v5.processFile
      70ms  4.67% 72.00%       70ms  4.67%  runtime.memeqbody
      70ms  4.67% 76.67%       80ms  5.33%  syscall.Syscall
      50ms  3.33% 80.00%       50ms  3.33%  github.com/adamdrake/faster-command-line-tools-in-nim/Go/v5.atoi

Most of the time is spent splitting the bytes and getting the key/value pairs, as expected. Let’s list getKeyVal to take a closer look.

260ms      670ms (flat, cum) 44.67% of Total
         .          .     51:	var tabIndex int
         .          .     52:	var k []byte
         .          .     53:	var v []byte
         .          .     54:	var field []byte
         .          .     55:
      20ms       20ms     56:	for i <= maxField && len(line) > 0 {
      20ms      160ms     57:		tabIndex = bytes.IndexByte(line, '\t') // returns -1 if not found
      20ms       20ms     58:		if tabIndex < 0 {
         .          .     59:			field = line
         .          .     60:			line = nil
         .          .     61:		} else {
      10ms       10ms     62:			field = line[:tabIndex]
      60ms       60ms     63:			line = line[tabIndex+1:]
         .          .     64:		}
         .          .     65:		switch i {
      10ms       10ms     66:		case keyField:
         .          .     67:			k = field
      10ms       10ms     68:		case valueField:
         .          .     69:			v = field
         .          .     70:		}
      10ms       10ms     71:		i++
         .          .     72:	}
      40ms       90ms     73:	val, _ := atoi(v)
      60ms      280ms     74:	return string(k), val
         .          .     75:}

We can see that the most time is spent getting the index of the bytes, and some on string casting for the key (so that we can use it as the map key). In a practical scenario, we’d probably stop here since, as mentioned above, this is probably about as fast as the code will get while still being sufficiently general, readable, etc. We can do better though, if we break the requirements of the original comparison.

V5: Let’s go nuts

Let’s put some of the practical concerns aside for the moment and focus on what we know about the problem. First, we are trusting our input data, which we’d normally be more careful about in a real-world situation. With that in mind, we can change the definitions of our atoi() function so that it doesn’t do any error checking, and so that it only operates on []byte and returns int. Here’s what the simplified atoi looks like.

func atoi(s []byte) int {
	i := 0
	x := 0
	for ; i < len(s); i++ {
		c := s[i]
		x = x*10 + int(c) - '0'
	}
	return x
}

We also know that the keys and values are always integers, so we could use an int type instead of string in the map keys, and we could also remove the string(k) cast in the getKeyVal() implementation above. However, we know in this case that we want the value between the first and second separator, and between the second and third separator, every time. So in this contrived instance, we don’t actually need to iterate and check delimiters. We can just treat the entire line like the bytestring it is, and only find the indexes for the first, second, and third instances of the delimiter.

Another important fact is that, for this dataset, we know the largest key. Therefore, instead of a map that we previously saw was costing us about 800ms to increment, we can use an array. In this case, our integer key will be the index, and the value at that index will be the sum for that key. With that in mind, we have a highly customised, but SUPER FAST version below.

func main() {

	file, err := os.Open(os.Args[1])
	defer file.Close()
	if err != nil {
		log.Fatal(err)
	}
	processFile(file)
}

func processFile(file *os.File) {
	var sumByKey [2009]int

	scanner := bufio.NewScanner(file)
	var key int
	var val int

	for scanner.Scan() {
		line := scanner.Bytes()
		firstTab := bytes.IndexByte(line, '\t')
		secondTab := bytes.IndexByte(line[firstTab+1:], '\t') + firstTab + 1
		thirdTab := bytes.IndexByte(line[secondTab+1:], '\t') + secondTab + 1
		key = atoi(line[firstTab+1 : secondTab])
		val = atoi(line[secondTab+1 : thirdTab])
		sumByKey[key] += val
	}
	var k int
	var v int
	for i, val := range sumByKey {
		if val > v {
			k = i
			v = val
		}
	}
	fmt.Printf("max_key: %d sum: %d\n", k, v)
}

What’s the pprof result?

Benchmark_processFile-4   	max_key: 2006 sum: 22569013
max_key: 2006 sum: 22569013
max_key: 2006 sum: 22569013
max_key: 2006 sum: 22569013
max_key: 2006 sum: 22569013
       3	 482369483 ns/op

It’s crazy fast! This version finishes in 0.482s (482ms) on my machine. Where is the most time spent now? (Note that because the benchmark ran multiple times in order to get a stable timing for the run, the pprof results are the sum of all the runs.)

      20ms      1.26s     38:	for scanner.Scan() {
      70ms       70ms     39:		line := scanner.Bytes()
      40ms      200ms     40:		firstTab := bytes.IndexByte(line, '\t')
     190ms      330ms     41:		secondTab := bytes.IndexByte(line[firstTab+1:], '\t') + firstTab + 1
     260ms      610ms     42:		thirdTab := bytes.IndexByte(line[secondTab+1:], '\t') + secondTab + 1
      80ms      160ms     43:		key = atoi(line[firstTab+1 : secondTab])
      40ms      180ms     44:		val = atoi(line[secondTab+1 : thirdTab])
      70ms       70ms     45:		sumByKey[key] += val
         .          .     46:	}

Most of the time is spent indexing into the bytestring, which is what we expected. There may be some clever ways to reduce this, in addition to some data structures which would allow us to do the incrementing faster, but at this point we’re pretty far down the path of optimizing past production usage, and hand-tuned Fortran/assembly isn’t really the goal. In the end though, we processed a 184MB file, with 10,512,769 rows, in 0.482s.

Happy optimizing!

Updated code:

Since the article was posted on Reddit and Hackernews, there was a user (Epiris) on Reddit who suggested some improvements. After their suggestions, below is the code and benchmarks for the updated version. It’s important to note that a personal requirement I have for the code is that it cannot read all data into memory, since we must assume that the data stream could be arbitrarily large.


func processLine(b []byte) (int, int) {
	key := 0
	val := 0
	i := 0
	for b[i] != '\t' {
		i++
	}
	for i++; b[i] != '\t'; i++ {
		key = key*10 + int(b[i]) - '0'
	}
	for i++; b[i] != '\t'; i++ {
		val = val*10 + int(b[i]) - '0'
	}
	return key, val
}

func processFile(file *os.File) (int, int) {
	var sumByKey [2009]int

	scanner := bufio.NewScanner(file)
	for scanner.Scan() {
		line := scanner.Bytes()
		k1, v1 := processLine(line)
		sumByKey[k1] += v1
	}
	var k int
	var v int
	for i, val := range sumByKey {
		if val > v {
			k = i
			v = val
		}
	}
	return k, v

}

And here is the associated benchmark.

func Benchmark_processFile(b *testing.B) {
	file, err := os.Open("../ngrams.tsv")
	defer file.Close()
	if err != nil {
		b.Fatal(err)
	}

	data, err := ioutil.ReadAll(file)
	if err != nil {
		b.Fatal(err)
	}

	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		k, v := processFile(data)
		if k != 2006 || v != 22569013 {
			b.Fatalf(`bad result %v | %v`, k, v)
		}
	}
}

The output of the benchmarking:

Benchmark_processFile-4   	       5	 309631168 ns/op	    4152 B/op	       4 allocs/op