Wednesday
Sep222021

Why I walked away from millions of dollars to found a startup

Back in college I was deeply inspired by Paul Graham's essays, like How to Make Wealth and How To Do What You Love. His essays helped me realize I wanted to be involved in startups after I graduated. I loved the notions of avoiding the bureaucracy of big companies and being part of a small team doing something big.

Fast-forward a few years to 2013 and I was in a predicament about where to go with my career. I was working at Twitter, leading a core infrastructure team I had started. I arrived at Twitter 1.5 years earlier via the acquisition of BackType, of which I was the first employee. I was on track towards a distinguished engineer type of career, where I would be influential in the industry through conference talks and writing books and articles. I would work comfortable 9 to 5 hours and make many millions of dollars through salary and equity.

There's little I find as inspirational as the United States space program in the 1960s. I've read so much about that period, and I've listened to Kennedy's speech at Rice University dozens of times. "We choose to go to the Moon in this decade and do the other things, not because they are easy, but because they are hard." The audacity of that speech gives me chills, and it's stunning Neil Armstrong stepped foot on the Moon only seven years later.

The people of Mercury, Gemini, and Apollo pushed humanity forward. They fundamentally expanded human potential. Something deep inside me wanted to be part of something like that.

While at Twitter, I started to get bothered by a simple observation. You can describe what the consumer Twitter product does (tweets, follows, search, trends, API, etc.) in a few hours, yet it took Twitter hundreds of man-years of effort to get that product to the point of scalability and stability (remember the "fail whale"?). Software is entirely about abstraction and automation, so how can it take hundreds of man-years to build something you can describe in hours?

Bridging this gap would revolutionize the basic economics of the software industry. What does the world look like when an individual or small team can build what currently requires a large company? What does the world look like when the economics of software development are radically improved? How much innovation would be unleashed? What new capabilities would humanity gain by enabling problems to be addressed that are currently too hard or too expensive?

Of course, I didn't know at the time how to bridge that gap. I had no concrete concept of what software development looks like in that universe. All I had were a few faint ideas of directions to explore. These faint ideas came from thinking about large-scale system design as I was working on my book.

So back in 2013, I had to decide whether to take that leap into exploring something which I wasn't sure was even possible. Should I walk away from millions of dollars and a high-status position to pursue a faint notion most people would say was crazy? It's not like I wouldn't be having impact in my current career track – just that my impact would be incremental and limited in comparison.

Years earlier I saw an interview with Jeff Bezos where he talked about why he left his highly paid job to start Amazon. What tipped him over the edge was the notion of regret. He would never regret shooting for the stars with his idea for an online bookstore, but he would regret for the rest of his life not trying. His "regret minimization framework" always stuck with me.

It was clear pursuing my idea would take years and require 100% of my focus, so working on it only on nights/weekends was out of the question. Another wrench in my situation at the time was the success of my open-source project Storm. I had many users eager to give me money to provide support/consulting services, and I was getting a lot of inbound interest from respected investors. Founding a company around a successful open-source project was a proven model that would have been an exciting endeavor.

Ultimately though, Storm was at a point where the major innovation was pretty much done and further improvements would be incremental. Forming a company around it would mostly be about monetization, and it would monopolize my focus for many, many years.

My decision to pursue my crazy idea hinged on a few factors. First, I already had enough equity vested from the BackType acquisition to sustain myself for however many years it would take to pursue my idea to either viability or failure. Second, I would regret it forever if I didn't try. Third, and most importantly, I recognized the opportunity to expand human potential and revolutionize an industry is incredibly rare. That such an opportunity was even available, as faint and tenuous as it was, made it precious beyond words.

So I left Twitter to start working on Red Planet Labs. I started by exploring that faint direction I had come upon before. More than anything else, the problem was one of abstraction. How do you express applications as diverse as Twitter, BackType, Reddit, Bank of America, Gmail, eBay, Splunk, and Slack concisely and with a common set of scalable abstractions?

It took me 5.5 years of excruciatingly difficult exploration to answer that question. It constantly felt like two steps forward, one step back. I would hit technical roadblocks and be stuck on details for months at a time. Completely unexpectedly, I came to suspect a new programming paradigm was needed to be able to form these abstractions. I resisted this idea for months until the evidence was so overwhelming I spent over a year making a new programming language to serve as the foundation for further exploration. That breakthrough led to more roadblocks, whose eventual resolutions led to even more roadblocks beyond. I rewrote code so many times my head spins from thinking about it.

And then in late 2018, I reached my goal. My idea was not only viable, it was more thrilling than anything I could have imagined. I understood what the abstractions were, how they worked, and how they composed together.

Since then, I raised a large seed round and built a fantastic team to pursue this vision with me. We started with my proof-of-concept and have been working since to turn it into a production-worthy development platform.

I cannot wait for the day when our platform is available to the world. It gives me those same chills I get when I think about Neil Armstrong and Buzz Aldrin descending to the lunar surface. This is what has kept me going for the past 8.5 years, and I'm immensely grateful just to have had the opportunity.

My team at Red Planet Labs inspires me every day, and I'm always looking for more great people to join us. If you're a great engineer who shares my passion for expanding human potential, please get in touch!

 

Wednesday
Mar012017

Clojure's missing piece

After two years of being open source and a lot of development, Specter has reached 1.0. In this post I'll explain why I call Specter "Clojure's missing piece" and why it should be used in most Clojure and ClojureScript programs.

Clojure's glaring hole

I've been using Clojure as my primary language for almost 7 years. With it I created Apache Storm which became a large, important, international project in the Big Data world. I also used Clojure to do a successful startup called BackType that was acquired by Twitter. I consider Clojure to be by far the best general-purpose language and have long evangelized it.

As much as I enjoyed programming in Clojure, I came to realize that beyond basic use cases Clojure fails to provide an elegant API for immutable programming. This deficiency in Clojure manifested itself in my projects as unreadable code, complexity, lots of bugs, and bad performance.

To demonstrate this deficiency, I'll start with a few examples where Clojure does immutable programming well:

(assoc {:a 1 :b 2} :a 3) ; => {:a 3, :b 2}
(dissoc {:a 1 :b 2} :a) ; => {:b 2}
(conj [1 2 3] 4) ; => [1 2 3 4]
(pop [1 2 3]) ; => [1 2]
(conj #{1 2 3} 4) ; => #{1 2 3 4}

This is how immutable programming should be. You perform an operation on your data and get back a new data structure with your changes represented. The details of how immutability is implemented, like structural sharing, copying, and creating new instances of the same type, are encapsulated.

Unfortunately, you only get this elegance when doing basic operations on basic data structures. The elegance evaporates for a lot of common operations, especially those involving compound data structures. And unfortunately, your problem domain frequently dictates that you must use compound data structures. In my own work, I use a lot of directed acyclic graphs. Nodes can be a variety of record types, with fields ranging the whole gamut of data structures. Some nodes can have embedded within them another DAG. Manipulating a data structure like this with vanilla Clojure was a non-starter and I was forced to come up with Specter to handle it.

As an example of the lack of elegance of vanilla Clojure, consider the task of namespace qualifying the keys of a map. The obvious way to do this with vanilla Clojure is as follows:

(defn ns-qualify-keys [m]
  (into {}
    (map (fn [[k v]] [(keyword (str *ns*) (name k)) v]))
    m
    ))

This function has two problems: it's slow, and it's wrong. Performance-wise, this function runs upwards of 3x as slow as an optimal implementation. It turns out the best way to iterate over the input map and construct the output map completely changes depending on the type of the map. Writing optimal code for this task requires esoteric knowledge of the internals of Clojure1. For something as fundamental as manipulating program data, the idiomatic way should also be the most performant way. This point alone is enough to warrant a new approach to immutable programming.

However, there's an even more important point: this code is wrong. If you give this function a sorted map, it will return an unsorted map as the result. That's a totally unintended effect to what should just be a transformation on each map key. Whereas the goal of the function is to change some subvalues of the map (the map keys), to write the function correctly you have to reconstruct everything around those subvalues as well. This is a bunch of extra "stuff" you have to do and it's easy to make errors in the reconstruction process. So while you can fix the function by replacing {} with (empty m), the very fact you have to do so is the problem.

It's not just the map you have to reconstruct in this function. It's also the keywords. The only thing you want to modify is the namespace of each keyword, but to do so you have to construct a whole new keyword with everything else exactly the same. Hence the need to call name and keyword in the anonymous function. Reconstructing the map and keywords is intertwined with the desire to change the namespaces. This is the very definition of complexity as defined by Rich Hickey in his great talk Simple Made Easy. I can't even begin to describe how often I wanted to tear my hair out debugging issues that were ultimately caused by a stupid error during reconstruction, like using map instead of mapv.

The way to improve this code is to factor out reusable functions map-keys and update-namespace which each concern themselves with doing their one task well. You can then compose them to implement the original function with proper separation of concerns (and optimal performance):

(defn ns-qualify-keys [m]
  (map-keys
    (fn [k]
      (update-namespace k (fn [_] (str *ns*)))
      )
    m
    ))

This code is significantly improved. Gone is the need to reconstruct the map and keywords – the code concerns itself only with navigating to the namespaces and changing them. That prior intertwinement between updating the namespaces and reconstructing the surrounding data structure has been teased out.

Unfortunately, this isn't quite enough. While having an expanded set of transformation functions like map-keys and update-namespace is the right start, composing them together using nested anonymous functions just doesn't work well. Code quickly becomes unreadable and hard to maintain. Even this code is hard to read, and it gets far worse with more levels of nesting. Consider this code which increments the even numbers in a map of vector of maps (with a hypothetical map-vals function):

(map-vals
  (fn [v]
    (mapv
      (fn [m2]
        (map-vals
          (fn [n] (if (even? n) (inc n) n))
          m2
          ))
      v
      ))
  m)

This is totally unreadable. What's needed are the transformation functions and an elegant way to compose them together. Clojure lacks both of these things.

Here's a non-exhaustive list of basic data structure transformations missing from Clojure:

- Transform each key of a map
- Transform each value of a map
- Transform each value of a generic sequence
- Append to a generic sequence
- Prepend to a generic sequence
- Add anywhere into a generic sequence (e.g. [1 2 4 5] => [1 2 3 4 5])
- Remove anywhere from a generic sequence (e.g. '(1 :a 2) => '(1 2))
- Update the name or namespace of a keyword or symbol

(Some of these cannot be implemented "efficiently", like prepending to a vector in O(1) time. I'm guessing this is why Clojure core does not provide the functionality. However, in reality the need to do these "inefficient" operations sometimes comes up and the efficiency aspect doesn't matter because it's infrequent or not a bottleneck.)

Some of these are quite difficult to implement optimally. And again, implementing these is not enough. What's also needed is an elegant way to compose them so that a nested data structure is just as easy (and readable) to handle as a top-level data structure.

Navigation, the missing abstraction

Specter provides an abstraction called "navigation" which completely solves the problem of doing elegant, high-performance immutable transformations on arbitrarily complicated data structures with all the details (e.g. reconstruction) encapsulated. Here's how to namespace qualify the keys of a map with Specter:

(setval [MAP-KEYS NAMESPACE] (str *ns*) any-map)

This code only changes the keys of the maps – the type of the map remains identical to what it was before. Additionally, the performance is near-optimal regardless of the type of the map (that is, it's extremely difficult to write faster code).

setval is like assoc-in on steroids. It's given a path, a value, and a piece of data to transform. The path tells it what subvalues to set to the new value. You should think of the path as navigating into the data structure, step by step. In this case, it navigates to each map key, and then for each map key navigates to its namespace. The navigators encapsulate the details of iteration and any necessary reconstruction during the transformation.

setval is a thin wrapper over the more general operation transform. transform takes in a function to apply to each navigated subvalue. For example:

(transform [ALL :a even?] inc [{:a 1} {:a 2 :b 1} {:a 4}])
;; => [{:a 1} {:a 3, :b 1} {:a 5}]

This transformation navigates to each element of the vector, then to the value of :a for every map, stays navigated at a value only if it's even, and then applies the inc function.

Specter has two components. The first is the core of Specter which defines the navigator abstraction (defnav) and implements a high-performance method of composing navigators together. You can read about how that works here. Implementing this method was the most difficult code I've ever written, and the fact it's even possible is a huge testament to the power of Clojure.

The second component of Specter is a comprehensive set of navigators for Clojure's core data structures. These include implementations of all of Clojure's aforementioned missing transformation functions, as well as many other kinds of navigations. These navigators encapsulate the most efficient means possible of performing traversal and reconstruction, often utilizing esoteric knowledge of Clojure's internals. It's important to note that navigators (like ALL, MAP-KEYS, NAMESPACE) are first-class objects and not syntax, a common point of confusion to those new to Specter.

As discussed before, these are the two missing pieces of Clojure. By filling these holes, Specter enables all immutable transformations to be done elegantly and with near-optimal performance.

Initially I developed Specter for the kinds of examples already shown: transforming subvalues in a compound data structure. Then I discovered a new category of navigators which shocked me in their expressiveness: substructure navigators. They give an extraordinary amount of control over the manipulation of data and made me believe that manipulating data via the composition of navigators should be a fundamental skill for all functional programmers. I'll go through a few examples of substructure navigators.

srange navigates you to a subsequence of a list or vector. That subsequence is then replaced by whatever sequence it's transformed to. Among other tasks, this navigator can be used to splice new elements into a sequence or remove elements. For example:

(setval [:a (srange 2 4)] [] {:a [1 2 3 4 5]})
;; => {:a [1 2 5]}

(setval (srange 2 2) [:A :A] '(1 2 3 4 5))
;; => (1 2 :A :A 3 4 5)

(transform [:a (srange 1 5)] reverse {:a [1 2 3 4 5 6]})
;; => {:a [1 5 4 3 2 6]}

filterer navigates you to a sequence of all elements matching the predicate. Transformations on that sequence are represented back in the original sequence. For example:

(transform (filterer even?) reverse [1 2 3 4 5 6 7 8 9])
;; => [1 8 3 6 5 4 7 2 9]

subselect navigates you to the sequence of elements matched by the given path. Like filterer, transformations on that sequence are represented back in the original locations for each value. For example:

(transform (subselect ALL :a even?)
  reverse
  [{:a 1} {:a 2 :b 1} {:a 4} {:a 5} {:a 6} {:a 8}])
;; => [{:a 1} {:a 8 :b 1} {:a 6} {:a 5} {:a 4} {:a 2}]]

Combining substructure navigators with other navigators enables some truly beautiful code. Take a look at how to increment the last odd number in a sequence:

(transform [(filterer odd?) LAST] inc [1 2 3 4 5 6])
;; => [1 2 3 4 6 6]

This is not an operation on a compound data structure, yet navigation proves to be a beautiful abstraction for this task, far more elegant than anything you could write in vanilla Clojure. I cannot emphasize this point enough. Although compound data structures are the most glaring use cases in need of Specter, Specter proves to be incredibly useful for the manipulation of non-compound data structures. Its nature as a composable abstraction lets you concisely express very sophisticated transformations. Every newly defined navigator increases the range of use cases combinatorially.

Querying, another natural use case for navigation

The navigation abstraction has been shown for performing transformations, but the abstraction also lends itself naturally for querying. Here's one example:

(select [ALL :a even?] [{:a 1} {:a 2 :b 1} {:a 4}])
;; => [2 4]

select always returns a vector of results. To select just a single value, use select-any:

(select-any [:a :b :c] {:a {:b {:c 1}}})
;; => 1

This code looks effectively identical to the equivalent get-in usage, except it runs 30% faster. All navigators can be used for querying, as the mental model of navigation is identical for both querying and transformation.

Navigation is generic and extensible

As explained before, the core of Specter is the defnav operator upon which Specter's entire collection of navigators is built. Specter's extensibility is one of its most important features, as you may use data structures other than the ones Clojure provides. As I mentioned, I do a lot of work with graphs, and I just wouldn't be able to work with them in any reasonable way without my internal collection of graph navigators (e.g. subgraph, topological traversal, to a node id, to outgoing nodes, to incoming nodes, etc.).

Specter's navigator interface is completely generic, able to express any navigator. Let's see how it works by looking at the implementation of a simple navigator, NAMESPACE:

(defnav ^{:doc "Navigates to the namespace portion of the keyword or symbol"}
  NAMESPACE
  []
  (select* [this structure next-fn]
    (next-fn (namespace structure)))
  (transform* [this structure next-fn]
    (let [name (name structure)
          new-ns (next-fn (namespace structure))]
      (cond (keyword? structure) (keyword new-ns name)
            (symbol? structure) (symbol new-ns name)
            :else (i/throw-illegal "NAMESPACE can only be used on symbols or keywords - " structure)
            ))))

There are two codepaths, one for querying (select*) and one for transforming (transform*). Querying in Specter works very similar to how transducers work. It achieves great performance by avoiding the creation of any intermediate data structures. In this case, it navigates to the namespace of the value by calling next-fn on it. A navigator that navigates to multiple subvalues (like ALL) must call next-fn on each subvalue. (As an aside, select* used to work exactly like the list monad. However, materializing intermediate lists during navigation killed performance so the method was changed in 0.12.0 to avoid that problem.)

transform* is expected to transform any subvalues using next-fn and perform any necessary operations to reconstruct the surrounding data structure. In this case, it transforms the namespace using next-fn and creates a new keyword or symbol depending on the type of the original value.

As you can see, the way navigators are defined is completely generic. Each codepath is a function free to do whatever it needs. It's a simple but very effective interface. Since every navigator composes with all the other navigators, the possibilities are endless. I recommend browsing the source to see how the built-in navigators are defined.

Besides defining a navigator using defnav, you can also define one by combining other navigators together. This is often useful when working with a specific compound data structure for your particular application. For example, suppose you're working on a Blackjack game and storing game information like this:

{:players [{:name "Hopper", :funds 100, :games-played 10}
           {:name "Eleven", :funds 6941, :games-played 7}
           {:name "Will", :funds -12, :games-played -8}]
 :bank {:funds 9850}}

Players are indexed by their id (e.g. index 0, index 1) but not by their name. If you wanted to easily manipulate a player by their name, you could define a navigator for it:

(defn player-by-name [name]
  (path :players
        ALL
        #(= (:name %) name)))


;; Take $1 from Hopper
(transform [(player-by-name "Hopper") :funds] dec data)

;; Retrieve :games-played for Will
(select-any [(player-by-name "Will") :games-played] data)

You could imagine defining a similar navigator player-by-id as well. Abstracting out domain-specific navigators like this significantly reduces the cognitive load of working with your data, as you can write your code in terms of application-specific concepts instead of lower level data structure concepts.

Finally, it's worth mentioning one last concept for building navigators called "dynamic navs". These are kind of like macros, but for navigators. Dynamic navs are too much to explain for this post, but you can read about them here. Also, the source code has many examples of them.

Recursive navigation

Unsurprisingly, Specter is fantastic at working with recursive data structures. For a simple example of this, suppose you had a tree represented using vectors:

(def tree [1 [2 [[3]] 4] [[5] 6] [7] 8 [[9]]])

You can define a navigator to all the leaves of the tree like this:

(def TREE-VALUES
  (recursive-path [] p
    (if-path vector?
      [ALL p]
      STAY)))

recursive-path allows your path definition to refer to itself, in this case using p. This path definition leverages the if-path navigator which uses the predicate to determine which path to continue on. If currently navigated to a vector, it recurses navigation at all elements of the vector. Otherwise, it uses STAY to stop traversing and finish navigation at the current point. The effect of all this is a depth-first traversal to each leaf node.

You can then compose TREE-VALUES with other navigators to do all sorts of manipulations:

;; Increment even leaves
(transform [TREE-VALUES even?] inc tree)
;; => [1 [3 [[3]] 5] [[5] 7] [7] 9 [[9]]]

;; Get odd leaves
(select [TREE-VALUES odd?] tree)
;; => [1 3 5 7 9]

;; Reverse order of even leaves (order based on depth-first search)
(transform (subselect TREE-VALUES even?) reverse tree)
;; => [1 [8 [[3]] 6] [[5] 4] [7] 2 [[9]]]

That you can define how to get to the values you care about once and easily reuse that logic for both querying and transformation is invaluable. And, as always, the performance is near-optimal for both querying and transformation.

New in Specter 1.0

Specter 1.0 is a really big milestone for the project. My goal with Specter had always been to hit that sweet spot of:

- Provides a set of navigators for Clojure's core data structures that feels complete
- Can be used completely dynamically (e.g. pass paths as arguments, store and use paths later, etc.)
- Near-optimal performance for all use cases

It took a few years to figure out how to achieve all this, and after a lot of work Specter is finally there.

The most important part of the Specter 1.0 release is a commitment to backwards compatibility. The 0.11.0, 0.12.0, and 0.13.0 releases had major breaking changes, all of which were required to get Specter to that sweet spot. Now that Specter is there, it will be very stable moving forward.

Specter 1.0 also has some significant new features filling in gaps in the library. For starters, it now can remove elements from maps and sequences with high performance:

;; Remove nil elements form a vector
(setval [ALL nil?] NONE [1 2 nil 3 nil])
;; => [1 2 3]

;; Remove even values from a map
(setval [MAP-VALS even?] NONE {:a 1 :b 2 :c 3})
;; => {:a 1, :c 3}

;; Remove a key from a nested map
(setval [:a :b :c] NONE {:a {:b {:c 1}}})
;; => {:a {:b {}}}

Specter 1.0 extends many of the core navigators to operate on strings:

;; Append to a string
(setval [:a END] "!!!" {:a "Hi"})
{:a "Hi!!!"}

;; Remove middle of a string
(setval (srange 1 3) "" "abcd")
;; => "ad"

Another major new feature is integration with Clojure's transducers via the new traverse-all operation. A lot of common transducer use cases can be expressed more elegantly with traverse-all:

;; Using Vanilla Clojure
(transduce
  (comp (map :a) (mapcat identity) (filter odd?))
  +
  [{:a [1 2]} {:a [3]} {:a [4 5]}])
;; => 9

;; The same logic expressed with Specter
(transduce
  (traverse-all [:a ALL odd?])
  +
  [{:a [1 2]} {:a [3]} {:a [4 5]}])

Specter 1.0 also adds some new navigators and has improved performance for a number of use cases. See the full release notes here.

Conclusion

I think of Specter differently than other libraries. Whereas most libraries provide functionality for a particular task, like interacting with a database, what Specter provides is fundamental to functional programming. Manipulating data structures is one of the most core things you do as a programmer, so even improving that a little bit would have a big effect on your programs. And I would argue that Specter improves that a lot, letting you work at a significantly higher level of abstraction while simultaneously improving performance a great deal.

If you have a Haskell background, I'm sure you're screaming to yourself "Lenses! Lenses!" I actually didn't know about lenses before I made Specter, but they are certainly very similar. I'm not on expert on Haskell, but what I do know is it explicitly distinguishes between navigators that go to one element (Lens) vs. zero or more (Traversal). I fail to see how that complication adds any sort of expressive or performance benefit, but perhaps a Haskeller out there can educate me.

And to answer one of the most common questions I get: no, Specter is not going to become part of Clojure core anytime soon. I'm open to the idea and think it would be a major improvement to Clojure, but the core team has no interest whatsoever. You can, however, achieve the same effect by adding Specter as a dependency and writing (use 'com.rpl.specter).

Leveraging Specter to its maximum potential requires you to change how you think about manipulating data. I've noticed many people struggle to grasp all the ways Specter can be applied, no doubt an instance of the blub paradox. Substructure navigation, recursive paths, and higher-order navigators are a few of the more advanced concepts that can seem overwhelming. I recommend starting with the most obvious use case, transformations of compound data structures, and letting your brain slowly adapt to the navigation way of thinking. That's the way I started, and the library slowly evolved from there as I saw the different ways these ideas could be leveraged. Eventually it will become part of your basic programming instinct, and you'll wonder how you ever lived without it.

1 For a small map (PersistentArrayMap), the optimal transformation uses the IMapIterable interface to iterate over the keys and values. The underlying array is created manually and the PersistentArrayMap/createAsIfByAssoc static method is used to create the new map. For a larger map (PersistentHashMap), reduce-kv for iteration plus transients for construction prove to be the most efficient method.

Monday
Jan092017

Recipe for a great programmer

Ingredients

  • Books for a computer science curriculum from a top university
  • Computer
  • Headphones
  • Internet
  • Stress ball
  • Pillow
  • Lighter fluid
  • Food

Directions

  1. Cover computer science books with lighter fluid
  2. Light books on fire
  3. Use flames to cook an energy-rich meal for the thousands of hours ahead
  4. Pick an editor
  5. Choose a project beyond current capabilities. Good ways to push boundaries:
    • Unfamiliar domain (e.g. large scale data processing, UI programming, high performance computing, games)
    • Exotic programming language
    • Larger in scope than any project before
  6. Shut up about your editor
  7. Attempt to build
  8. Stop procrastinating on Hacker News
  9. Re-attempt to build
  10. Squeeze stress ball and scream into pillow as necessary to keep sanity
  11. When stuck:
    • Paste stack traces into Google
    • Find appropriate mailing list to get guidance
    • Realize that real learning happens when you are stuck, uncomfortable, and/or frustrated
    • Seek out books, classes, or other resources AFTER you have a good understanding of your deficiencies
  12. Repeat #4 to #10 for at least 10 years

Results guaranteed! (to the same extent static types guarantee bug-free programs)

Wednesday
Jul132016

How becoming a pilot made me a better programmer

Two years ago I trained for my private pilot license over the course of nine months. Besides being fun and exhilarating, flying requires a lot of knowledge and skill: aerodynamics, crosswind landings, stall recovery, navigation, recognizing and dealing with system failures, and so on.

Something that always amazes me whenever I explore a completely different body of knowledge is the bleeding effect it has to other subjects I care about. There are fundamental principles that cross domains, and seeing those principles in completely different contexts gives you a much deeper understanding of them. Aviation has been no exception – it taught me a lot about being a better programmer. I wouldn't say it taught me anything completely new, but something about risking a horrifying death in the future by not understanding these principles drove the points home in a way that just programming did not.

With my instructor Chuck Hellweg after my first solo

Do not treat systems as black boxes

I'll never forget the day I did spin training. In a spin (here's a video of one), the plane basically turns into a big rock in the sky, tumbling and rotating while plummeting towards the ground. I've heard it called a "rollercoaster without rails". My instructor got a big kick out of putting me through my first spin – afterwards he said, "I've never seen anyone curse like that before!"

It's easy to think of a plane as a "black box", where the yoke and rudder turn the plane while the throttle adds energy for airspeed or for climbing. In normal stable flight, that's basically what the inputs do. In a spin however, thinking of the plane's inputs that way will get you killed. Adding power will aggravate the spin and pitching up will prevent you from exiting the spin. Trying to roll right could roll you left instead, potentially violently. To exit a spin you must operate the plane differently: put the power to idle, use only the rudder to stop rotation, and pitch down into a near vertical dive. Then the plane is flying again and you can carefully recover from the dive.

To become a pilot you could try to memorize an arbitrary set of rules for how to fly the plane in various scenarios: in cruising flight do this, when landing do that, in a spin do this other thing. But then nothing is intuitive, and you're going to make a lot of mistakes. This is especially true when things go wrong: if you have engine problems in flight, knowing how the engine, carburetor, and fuel system work could mean the difference between a safe landing on a runway while your passengers take selfies versus a dangerous landing on a highway while your passengers desperately send texts to loved ones. You use your tools much more effectively and safely when you understand their implementation.

In software I don't think anything is treated more like a black box than the database. Programmers want to store and retrieve data using the database interface and then leave it to the ops guys to get it running robustly. However, a database's interface only tells a small part of what it means to use that database, and there's a lot of crucial information it doesn't tell you. Which queries will run efficiently? Which queries will consume large amounts of resources? What happens if there are hardware failures? If one application has a bug that triggers an accidental denial-of-service attack on the database, what will happen to other applications? Will just the one application fail, or will it trigger a larger cascading failure? If the database is distributed, how is data partitioned among nodes? What happens when there is data skew?

Understanding the answers to these questions is critical to architecting robust systems, and you need to know how the database works to answer them effectively. When you understand the mechanisms underlying the databases you use, you understand their limits and failure modes much better, and you can use this knowledge to architect better systems.

For example, one of my big motivations in developing the Lambda Architecture was to avoid the failure modes and complexity of the standard architecture of an application incrementally reading and writing into a distributed database. One example of this complexity is managing online compaction – if not done extremely carefully, cascading failure and massive outages will result (as many companies learn the hard way). Another example is eventual consistency – achieving it in a fully incremental architecture is incredibly prone to error, and the smallest mistake will lead to data corruption (the infrequent kind that's very hard to track down). I strongly believe the best way to avoid failures is to design architectures so those failure modes don't exist in the first place. The more that can go wrong, the more that will go wrong.

I recommend reading through Kyle Kingsbury's Jepsen posts to see that relationship between database robustness and complexity firsthand. A common theme throughout those posts is how often distributed databases contradict their marketing, like by silently losing massive amounts of data during partitions. For me, knowing how something works gives me confidence it will work properly, and if the implementation is too complex or confusing I proceed extremely cautiously.

There's an even bigger reason to understand how your tools work. Too many programmers treat their tools like LEGO DUPLO pieces and limit solutions to how those tools can be used and combined. But when you understand how your tools work, then you can reason in terms of what would be the ideal solution if ideal tools existed. Software becomes clay instead of LEGOs. Something I've found repeatedly in my career is the languages, databases, and other tools we use are so far removed from ideal that the opportunities for fundamental innovation are constant. Thinking this way led me directly to creating Storm, Cascalog, and ElephantDB, all of which gave myself and my team huge leverage. Even when constructing the ideal tools yourself isn't practical, knowing what would be ideal is hugely helpful when evaluating possible solutions.

I'm a big fan of Martin Thompson's term "mechanical sympathy". The kind of high performance work he's known for is absolutely impossible without a deep understanding of implementation. And performance optimization in general is much easier the more you understand the implementation of your dependencies.

If you read my last post about the limited value of computer science educations, you might think I'm contradicting myself by arguing for understanding the internals of your dependencies. After all, that's a major focus of a computer science education. My answer is I advocate for a computer science education for programmers to the same extent I advocate for an aeronautical engineering degree for pilots. The degrees are relevant, useful, and helpful, but on their own do little to make you a good programmer or a good pilot. You can learn how your tools work, oftentimes much more effectively, in the trenches as opposed to on a whiteboard. Where abstract education pays its dividends is when you push the boundaries of your field. The deep problem solving and algorithm skills I gained from my computer science education helped me greatly in that regard, and if I were a test pilot I imagine a formal aeronautical engineering education would be essential.

Correctly monitoring production systems

Aviation also helped me gain a deeper understanding of monitoring production systems. I used to think of monitoring as an afterthought, something to be added after the actual system was built. Now I consider monitoring a core piece of any system and take it into account from the start.

In aviation, you monitor the plane using a variety of instruments. The primary instruments measure altitude, airspeed, direction, and orientation using gyros and air pressure. You also have a navigation tool called VOR that can determine your direction from a fixed radio beacon (but not your distance). That you can fly safely with zero visibility solely on these primitive instruments is amazing.

Last year I did some special training in a simulator where I told the instructor to try his best to kill me. He enthusiastically put me in a variety of unusual situations that involved bad weather as well as system failures. When an engine fails on a plane it's fairly obvious, but when an instrument fails sometimes the only way to tell is by looking at the other instruments (especially when you're inside a cloud and have no outside references to use). But when the instruments are inconsistent with each other, how do you know which instrument is broken? Deciding wrong could lead to further bad decisions – like flying the plane straight into the ground while thinking you're cruising at 3500'.

In one scenario he clogged my static source port while I was climbing inside a cloud. When I saw my altimeter was frozen, I initially thought I had let the plane pitch down a bit, stopping the climb. But when I crosschecked my instruments, I saw the other instruments did not support this hypothesis. If I had acted on that first instinct by pitching up, I could have stalled the plane and put myself in a very bad (simulated) situation. Instead, I correctly diagnosed the problem and switched to the alternate static port to bring the altimeter back to life.

In another scenario, he killed my vacuum pump, causing my directional gyro to fail. I then had to navigate using only the magnetic compass. Because I understood the magnetic compass is inaccurate during acceleration, I did not overreact to the strange readings it gave me during turns. If I had remembered how the readings deviate during turns depending on which direction you're facing (which I will internalize when I do my instrument rating), I would have been able to do those turns even more precisely.

There are two lessons here. First, I was able to read the instruments correctly because of my understanding of how they compute their measurements. Second, the redundancy between the instruments allowed me to diagnose and debug any issues. The same lessons apply to software.

Deploying software to production is like flying an airplane through a cloud. You know the airplane is well-designed and well-tested, but whether it's operating properly at this particular moment can only be determined from your instruments.

Consider a basic property of a production software system: it does not lose or corrupt data. You may have thorough tests, but unexpected things happen in production. How do you know this crucial property is being maintained once the software is deployed? When I started my first job out of college, some of the code for the company's product astounded me – and not in a good way. The code consisted of a plethora of special cases to handle the various ways the database had been corrupted over time.

One of my big "aha" moments as a programmer has been embracing the philosophy of "your code is wrong!". Even if you have a fantastic test suite, bugs can and do make it to production. The implications of this are huge. When you acknowledge bugs are inevitable, then bugs that corrupt your data are inevitable as well. When one of these bugs strikes in production, you need to know as soon as possible to prevent even more corruption. Maybe the bug will trigger an exception somewhere, but you can't rely on that. Your only option is to instrument your systems. If you don't then your customers will be your data corruption instruments – and that's not good for anyone.

Here's an example of a data loss bug I had that my thorough tests failed to catch. I had a Hadoop job that would "shuffle" data to the reducers by generating random numbers for keys. It turns out that for Hadoop to be completely fault-tolerant, tasks must output the same results each time they are run. Since my code was generating keys randomly on each run, a reduce task randomly failing would cause some data loss. Needless to say, discovering this as the cause of our data loss was one of those hair-pulling "why did I ever become a programmer?" debugging sessions.

There's a lot you can do to monitor for data corruption. One great technique is to continuously generate "dummy data" and push it through your production systems. Then you check that every stage of your processing outputs exactly the expected results. The end result is a simple instrument that either says "corrupt" or "non corrupt" – with the "corrupt" case showing the difference between the expected and actual results.

Just because your "dummy data" instrument says "not corrupt" does not mean data corruption is not happening. By necessity such a technique funnels a relatively small scale of data through your systems, so if you have data corruption errors that occur infrequently you probably won't detect them. To properly understand what the "not corrupt" indicator is actually saying requires an understanding of how the "dummy data" instrument works.

Another great way to detect data corruption is through aggregates like counts. For example, in a batch processing pipeline with many stages, you oftentimes know exactly how many records are expected in the output of a stage given the number of input records. Sometimes the relationship isn't static but there's a cheap way to augment the processing code to determine how many output records there should be. Using this technique would have caught the random number bug I described before.

A lot of data processing pipelines have a normalization stage, such as converting free-form location fields into structured locations (e.g. "SF" becomes "San Francisco, CA, USA"). A useful measurement is the percentage of data records that fail to normalize. If that percentage changes significantly in a short time frame, that means either 1) there's a problem with the normalizer, 2) there's a problem with the data coming in, 3) user behavior has suddenly changed, or 4) your instrument is broken.

Obviously none of these instruments are perfect. But the more checks you have, the harder it is for a corruption-causing bug to go unnoticed. Oftentimes your measurements partially overlap - this redundancy is a good thing and helps verify the instruments are functioning correctly. An example of this is measuring the overall latency of a pipeline as well as measuring the latencies of the individual components. If the latencies don't add up you either have a bug or are missing something.

Gil Tene has a great presentation on the incredibly common mistakes people make when measuring the latency of their systems. I highly recommend checking that out, and it's a great example of needing to understand how your measurements work to properly interpret them.

When I construct software now, I don't consider any part of what I build to be robust unless I can verify it with measurements in production. I've seen way too many crazy things happen in production to believe testing is sufficient to making software robust. Monitoring is critical to finding how the expectation of production behavior differs from the reality. Making the expected properties of software measurable is not always easy, and it often has major effects on software design. For this reason I account for it in all stages of the development process.

Conclusion

I've discussed some of the biggest areas aviation has improved me as a programmer: going deeper into the tools I use and thinking of instrumentation in a more holistic way. There's an even broader point here though. Becoming a pilot seems unrelated to being a programmer, yet there are surprising overlaps between the two fields. I've found every time I've pursued subjects completely unrelated to my career, there are areas of overlap that improve me in unexpected ways.

This can be captured as a general philosophy of life: consistently find ways to get outside your comfort zone and do things that challenge and frustrate you. Besides learning new things, you'll become better at everything you do because of the surprising overlaps of knowledge. The most extreme example of this for me happened three years ago when I did standup comedy for three months straight. The lessons I learned from standup have been invaluable in improving my skill as a technical speaker. And surprisingly, it's helped me become a better writer as well.

One last thing – don't let my casual references to death throughout this post scare you away from experiencing the joy of small plane flight. The only time most people hear about small planes is when they crash, so the common fear of them is just plain selection bias. If you have a few hours and an extra couple hundred bucks to spare, I highly recommend doing an intro flight and experiencing the thrill of being a pilot.

Monday
Jun202016

The limited value of a computer science education

I started programming at the age of 10 on my TI-82 graphing calculator. I loved video games as a kid and was ecstatic to learn I could make them myself – while also distracting myself from the boredom of math class. Very rapidly I came to love the craft of programming in and of itself and entered into a long and rewarding journey of nurturing my skills as a programmer.

Part of that journey included a bachelor's and master's degree in computer science at Stanford, including a significant amount of time as part of a research group. As a professional I've built many useful projects and contributed significant knowledge to my field. In this post I will explore what value, if any, a computer science education provides for a professional developer.

There are people who take the extreme positions, that a computer science education is totally useless or that a computer science education is completely essential. My position is that a computer science education is overvalued, though not useless. The vast majority of what a programmer does, forming good abstractions and avoiding complexity, is almost completely untouched by computer science curriculums.

A handful of times in my career, I encountered problems which I don't think I would have solved without my computer science education. These all related to algorithms in distributed systems that required formal proofs to be confident of their correctness. These few cases happened to be critical to their respective systems – for example, this algorithm is what made Storm possible. So in that sense, I personally have gotten a lot of benefit from my computer science education.

However, I also have to acknowledge that the scale and difficulty of the problems I've worked on is unusual compared to the vast majority of programmers. Most people aren't designing new distributed processing paradigms. I highly doubt most programmers need to know how to do formal proofs of algorithms, determine asymptotic complexity of algorithms, or even know that much about data structures. Most people are doing the kind of programming that involves piecing together components like databases, application frameworks, deployment tools, monitoring systems, and data processing frameworks into an application.

Of course there are some fields for which having a computer science education is much more relevant on a daily basis – distributed systems is one that comes to mind. But even in that field, the vast majority of your work is forming good abstractions and avoiding complexity – those two crucial skills that you don't learn in a computer science education. The benefits of a computer science education are limited to the value you get out of a small percentage of situations. For me those situations were very high value, but most programmers don't even run into them.

The reason why I even feel the need to write about this topic is because computer science skills are generally all that is tested in programmer interviews, via the infamous whiteboard coding interview. Whether someone can or cannot solve some cute algorithm problem in a high-pressure situation tells you nothing about that person's ability to write solid, clean, well-structured programs in normal working conditions. This practice is akin to hiring pilots based on how they do on an aeronautical engineering exam. That knowledge is relevant and useful but tells you little about the skill that person has for the job.

What's especially bizarre about this practice of recruiting is there exists an alternative that is better in every respect. Take-home projects are a decent measure of programming ability and are far less of a time investment for the company (15 minutes to evaluate project vs. 1 hour for interview). Plus I find that most candidates prefer take-home projects because it's less stressful and gives a better opportunity to show off their skills.

So in that sense, due to irrationality in the software industry a computer science education is useful for doing well in job interviews. But there have been some trends recently indicating that these nonsensical recruiting practices will change – programs such as Recurse Center and Insight Data Engineering help their residents get jobs by using the projects they've built over the course of one to three months as their selling point. So as the industry adapts and computer science knowledge ceases to be so crucial to getting a programmer job, a complete computer science education will become less important. How long that will take to happen is unclear.

In my experience, the only way to become a good programmer is to do it... a lot. Reading books and taking classes won't make you a better programmer – they might provide some guidance, but real improvement will come from the grueling hours you put into real projects. I could read books all day about how to execute a perfect barrel roll, but it will all be meaningless until I put on a parachute and get in the cockpit. Programming is the same way.

Overall I have mixed feelings about the value of a computer science education, mostly because of the personal benefit I have gotten from mine. For most cases though, I think it is severely overvalued. It's very strange to observe an industry with major talent shortages, and then to know perfectly good self-taught programmers get prematurely rejected in interviews because they don't have a computer science background. I hope to see the industry improve in this respect, but in the meantime I'm happy to exploit this imbalance as a competitive advantage.

Monday
Sep282015

Functional-navigational programming in Clojure(Script) with Specter

In February I open-sourced a library called Specter, and in my own work it has become by far my most-used library. It has changed the way I approach some fundamental aspects of programming, namely how I interact with and manipulate my program's data. I call the approach I take now "functional-navigational programming". I'm not the first one to come up with these ideas, nor is it a full-fledged paradigm in the sense of object-oriented or functional programming. But I give it a name because these techniques have changed the way I go about structuring huge amounts of my code. The best part is the abstractions used in this approach are not only concise and elegant – but also have performance rivaling hand-optimized code.

One of Clojure's greatest strengths is its powerful facilities for doing immutable programming: persistent data structures and a standard library that incorporates immutable programming at its core. Where Clojure's standard library gives you difficulty is dealing with composite immutable data structures, like a map of lists of maps. This is incredibly common, and I've run into it over and over in my years of programming Clojure. You're forced to write code that not only finds and manipulates the subvalue you care about, but also reconstructs the rest of the input data structure in the process.

Much more powerful than having getters and setters for individual data structures is having navigators into those data structures. Navigators can be composed arbitrarily, allowing you to concisely manipulate composite data structures of arbitrary sophistication. Let's look at an example to illustrate this difference. Suppose you're writing a program whose state looks something like this:

(def world
  {:people [{:money 129827 :name "Alice Brown"}
            {:money 100 :name "John Smith"}
            {:money 6821212339 :name "Donald Trump"}
            {:money 2870 :name "Charlie Johnson"}
            {:money 8273821 :name "Charlie Rose"}
            ]
   :bank {:funds 4782328748273}}
  )

This data structure contains information about a bank and its list of customers. Notice that customers are indexed by the order in which they joined the bank, not by their names.

Now suppose you want to do a simple transformation that transfers money from a user to the bank. This code is ugly but also typical of Clojure code that deals with composite data structures:

(defn user->bank [world name amt]
  (let [;; First, find out how much money that user has
        ;; to determine whether or not this is a valid transfer
        curr-funds (->> world
                        :people
                        (filter (fn [user] (= (:name user) name)))
                        first
                        :money
                        )]
   (if (< curr-funds amt)
     (throw (IllegalArgumentException. "Not enough funds!"))
     ;; If valid, then need to subtract the transfer amount from the
     ;; user and add the amount to the bank
     (-> world
         (update 
           :people
           (fn [user-list]
             ;; Important to use mapv to maintain the type of the 
             ;; sequence containing the list of users. This code
             ;; modifies the user matching the name and keeps
             ;; every other user in the sequence the same.
             (mapv (fn [user]
                     ;; Notice how nested this code is that manipulates the users          
                     (if (= (:name user) name)
                       (update user :money #(+ % amt))
                       ;; If a user doesn't match the name during the scan,
                       ;; don't modify them
                       user
                       ))
                   user-list)))
         (update-in
           [:bank :funds]
           #(- % amt))
           ))))

There's a lot of problems with this code:

  • Not only does it need to do the appropriate credit and deduction, it also needs to reconstruct the world data structure it traversed on its way to the manipulated values. This logic is spread throughout the function.
  • The code is nested and difficult to read.
  • This function is specific to only one particular kind of transfer. There are many other kinds of transfers you may want to do: bank to a user, bank to many users, users to users, and so on. Each one of these functions would be burdened with the same necessity of navigating and reconstructing the data structure.

A better approach

Of course, there's a far better approach. Let's take a look at a generic transfer function that uses Specter to do a many-to-many transfer of a fixed amount between any two sets of entities. To be clear on the semantics of this function:

  • If the bank, Bob, and Alice transfer $50 to Jim and Sally, then Jim and Sally each receive $150 while the bank, Bob, and Alice each lose $100.
  • If any of the transferring entities lack sufficient funds, an error is thrown.

Here is the implementation:

(defn transfer
  "Note that this function works on *any* world structure. This handles
   arbitrary many to many transfers of a fixed amount without overdrawing anyone"
  [world from-path to-path amt]
  (let [;; Get the sequence of funds for all entities making a transfer
        givers (select from-path world)

        ;; Get the sequence of funds for all entities receiving a transfer
        receivers (select to-path world)

        ;; Compute total amount each receiver will be credited
        total-receive (* amt (count givers))

        ;; Compute total amount each transferrer will be deducted
        total-give (* amt (count receivers))]

    ;; Make sure every transferrer has sufficient funds
    (if (every? #(>= % total-give) givers)
      (->> world
           ;; Deduct from transferrers
           (transform from-path #(- % total-give))
           ;; Credit the receivers
           (transform to-path #(+ % total-receive))
           )
      (throw (IllegalArgumentException. "Not enough funds!"))
      )))

The keys to this code are the "select" and "transform" functions. They utilize the concept of a "path" which identifies elements within a data structure that should be queried or manipulated. Let's hold off for a second on the details of what those paths look like and make some observations about this transfer function:

  • It's extremely generic. It handles fixed many-to-many transfers between any sets of entities.
  • It's easy to read and elegant.
  • Unlike the first example, this code is agnostic to the details of the "world" data structure. This works with any representation of the world.
  • It's very fast. Even though it's so much more generic than the initial user->bank function, it only executes slightly slower for that one particular use case.

This is some of the power of functional-navigational programming. How to get to your data is separated from what you want to do with it. This allows for generic and powerful abstractions like the transfer function.

Of course, the transfer function is only as powerful as the paths that can be passed to it. So let's take a quick detour to explore the concept of a "path" within a data structure. You'll see that they're extremely flexible and allow you to navigate in a very fine-grained way.

Core concepts of Specter

A path is just a list of steps for how to navigate into a data structure. That path can then be used to either query for subvalues or to do a transformation of a data structure. For example, if your data structure is a list of maps, here's code that increments all even values for :a keys:

(transform [ALL :a even?]
            inc
            [{:a 2 :b 3} {:a 1} {:a 4}])
;; => [{:a 3 :b 3} {:a 1} {:a 5}]

First, the "ALL" selector navigates to every map in the sequence. For each map, the ":a" keyword navigates to the value for that key within every map. Then, the "even?" function only stays at values which are even. After the selector is the "transform function" which takes in each value navigated to and returns its replacement value.

To understand how this code works its helpful to walk through how the data flows from step to step. First you start off with the input data structure:

[{:a 2 :b 3} {:a 1} {:a 4}]

ALL navigates to each element of the sequence, continuing the navigation from each element independently:

{:a 2 :b 3}
{:a 1}
{:a 4}

The :a keyword navigates to the value of that keyword for each element, leading to:

2
1
4

Then, the even? function only stays navigated at values which match the filter. This removes 1 from the navigated values, leaving:

2
4

Now Specter has reached the end of navigation, so it applies the update function to every value:

3
5

Now it's time to reconstruct the original data structure with these changes applied. To do this the navigators are traversed in reverse. The even? function brings back any values which it filtered out before:

3
1
5

The :a keyword replaces the values for :a in the original maps with the new values:

{:a 3 :b 3}
{:a 1}
{:a 5}

Finally, the ALL keyword puts everything back together in a sequence of the same type of the original sequence:

[{:a 3 :b 3} {:a 1} {:a 5}]

And that completes this transformation. Let's take a look at another example. This one increments the last odd number in a sequence of numbers:

(transform [(filterer odd?) LAST]
           inc
           [2 1 3 6 7 4 8])
;; => [2 1 3 6 8 4 8]

"(filterer odd?)" navigates to a view of the sequence that only contains the odd numbers. "LAST" navigates to the last element of that sequence. When the data structure is reconstructed, only the last odd number is incremented.

Let's look at the data flow for this transformation as well. The transformation starts with the input data structure:

[2 1 3 6 7 4 8]

The (filterer odd?) navigator filters the sequence for odd numbers. It also remembers to which index in the original sequence each of the filtered numbers came from. This will be used later during reconstruction.

[1 3 7]

The LAST navigator simply takes the last value of the sequence:

7

This is the end of navigation, so the update function is applied:

8

Now Specter works backwards through the navigators to reconstruct the data structure. LAST replaces the last value of its input sequence:

[1 3 8]

(filterer odd?) uses the index map it made before to set the values of its input sequence at the appropriate indices:

[2 1 3 6 8 4 8]

That's how you end up with the final result.

The next example reverses the positions of all the even numbers between indices 4 and 11:

(transform [(srange 4 11) (filterer even?)]
           reverse
           [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15])
;; => [0 1 2 3 10 5 8 7 6 9 4 11 12 13 14 15]

"srange" navigates to the subsequence bound by the two specified indices. The "reverse" function receives all the odd numbers between those two indices, and then Specter reconstructs the original data structure with the appropriate changes. This example is nifty because writing it by hand is actually quite difficult.

Let's take a look at doing queries using Specter. Here's how to get every number divisible by three out of a sequence of sequences:

(select [ALL ALL #(= 0 (mod % 3))]
        [[1 2 3 4] [] [5 3 2 18] [2 4 6] [12]])
;;=> [3 3 18 6 12]

"select" always returns a sequence of results because paths can select many elements. In this case two ALL's are needed because there are two levels of sequences in this data structure.

As you can see, each component of a path specifies one step of a navigation. What's powerful is that these individual steps can be composed together any which way, arbitrarily. This allows you to specify queries and transformations of immense sophistication.

At the core of Specter is a protocol for specifying one step of navigation. It looks like this:

(defprotocol StructurePath
  (select* [this structure next-fn])
  (transform* [this structure next-fn])
  )

Every single selector you've seen so far is defined in terms of this protocol. For example, here's how keywords implement it:

(extend-type clojure.lang.Keyword
  StructurePath
  (select* [kw structure next-fn]
    (next-fn (get structure kw)))
  (transform* [kw structure next-fn]
    (assoc structure kw (next-fn (get structure kw)))
    ))

The protocol has one method for doing selects and another for doing transforms. In the select case, the "next function" finishes the selection from whatever values this step navigates to. In the transform case, the "next function" will transform any value this step navigates to, and the step is responsible for incorporating any transformed subvalues into the original data structure. As you can see from this example, the StructurePath implementation for keywords perfectly captures what it means to navigate within a data structure by a keyword.

Back to the bank example

Now that you've seen how paths work within Specter, I'll demonstrate how flexible this abstraction is with a variety of different kinds of transfers on the original bank example.

Here's how to get every person to pay a $1 fee to the bank:

(defn pay-fee [world]
  (transfer world
            [:people ALL :money]
            [:bank :funds]
            1))

Here's how to have every person receive $1 from the bank. The arguments are simply reversed as you would expect:

(defn bank-give-dollar [world]
  (transfer world
            [:bank :funds]
            [:people ALL :money]
            1))

Here's a function that returns a path to a particular user. It scans through all users and only selects those matching the given name. This function can be used to do transfers involving particular users.

(defn user [name]
  [:people
   ALL
   #(= (:name %) name)])

Later on, you'll see that there's a better way to implement "user" that allows for much better performance. For now, here's a function that transfers between two users:

(defn transfer-users [world from to amt]
  (transfer world
            [(user from) :money]
            [(user to) :money]
            amt))

And here's a function to implement the initial example, transferring money from a user to the bank:

(defn user->bank [world from amt]
  (transfer world
            [(user from) :money]
            [:bank :funds]
            amt))

Finally, here's a function to give a $5000 "loyalty bonus" to the oldest three users of the bank:

(defn bank-loyal-bonus [world]
  (transfer world
            [:bank :funds]
            [:people (srange 0 3) ALL :money]
            5000))

As you can see, Specter can navigate through data structures in a very diverse set of ways. And what you've seen so far is just the tip of the iceberg: see the README to see more of the selectors that come with Specter.

Without Specter, implementing each of these transformations would have been tedious and repetitive – each would have been burdened with precisely reconstructing anything in the input data structure it didn't touch. But by having a few simple navigators and composing them together, each specific transformation can be handled very easily. This is the crux of functional-navigational programming: better a handful of generic navigators than a lot of specific operations.

Achieving high performance with precompilation

Edit: The need to manually precompile paths has been almost completely superseded by the inline factoring and caching feature introduced in Specter 0.11.0. See this post for more details.

Using Specter as shown actually won't get you very good performance – interpreting those paths is quite costly. But the good news is that with a slight amount more effort, you can get performance that's 5-10x better and rivals hand-optimized code.

Most of the cost of running a select or transform is interpreting those paths, especially when the data structure being manipulated is small and the individual navigation operations are cheap. So Specter allows you to precompile your paths to achieve much higher perfomance by stripping away all the overhead. Here's a precompiled version of one of the previous examples:

(def compiled-path (comp-paths ALL :a even?))
(transform compiled-path
           inc
           [{:a 2 :b 3} {:a 1} {:a 4}])

Precompiled paths act just like any other navigator and can be composed with other navigators. If you know for sure that your path is going to be precompiled, you can use the compiled-select and compiled-transform functions to squeeze out even more performance.

Let's take a look at some basic microbenchmarks to see how good Specter's performance is. Here are five different ways to get a value out of a many-nested map. The benchmark function times how long it takes to run its input function that many times.

(def DATA {:a {:b {:c 1}}})
(def compiled-path (comp-paths :a :b :c))

(benchmark 1000000 #(get-in DATA [:a :b :c]))
;; => "Elapsed time: 77.018 msecs"

(benchmark 1000000 #(select [:a :b :c] DATA))
;; => "Elapsed time: 4143.343 msecs"

(benchmark 1000000 #(select compiled-path DATA))
;; => "Elapsed time: 63.183 msecs"

(benchmark 1000000 #(compiled-select compiled-path DATA))
;; => "Elapsed time: 51.964 msecs"

(benchmark 1000000 #(-> DATA :a :b :c vector))
;; => "Elapsed time: 34.235 msecs"

You can see what a huge difference precompilation makes, giving almost a 100x improvement for this particular use case. The fully compiled Specter execution is also more than 30% faster than get-in, one of Clojure's few built-in functions for dealing with nested data structures! Finally, the last example shows how long it takes to run the equivalent selection with direct, inlined code. Specter's not too far off, especially when you consider how high-level of an abstraction it is.

Let's now look at a benchmark for transforms. Here are five different ways to increment the value in that nested map:

(benchmark 1000000 #(update-in DATA [:a :b :c] inc))
;; => "Elapsed time: 1037.94 msecs"

(benchmark 1000000 #(transform [:a :b :c] inc DATA))
;; => "Elapsed time: 4305.429 msecs"

(benchmark 1000000 #(transform compiled-path inc DATA))
;; => "Elapsed time: 184.593 msecs"

(benchmark 1000000 #(compiled-transform compiled-path inc DATA))
;; => "Elapsed time: 169.841 msecs"

(defn manual-transform [data]
  (update data
          :a
          (fn [d1]
            (update d1
                    :b
                    (fn [d2]
                      (update d2 :c inc))))))
(benchmark 1000000 #(manual-transform DATA))
;; => "Elapsed time: 161.945 msecs"

Once again, precompilation brings massive performance improvements. In this case, the comparison against Clojure's built-in equivalent update-in is even more dramatic: Specter is over 5x faster. Even more striking, the last benchmark measures a hand-written implementation, and Specter's performance is extremely close to it.

Precompile anywhere, anytime

Up until a few weeks ago, this was the extent of Specter's story. Specter could precompile paths and achieve great performance if the path was known statically. In the 0.7.0 release though, Specter gained a new capability that allows it to precompile any path at any time, even if the path requires parameters which aren't available yet. This lets you use Specter's very high level of abstraction with great performance in all situations. Since the problem Specter solves is so common, with this new capability I'm now comfortable referring to Specter as Clojure's missing piece.

Let's take a look at compiling paths that don't yet have their parameters. Earlier you saw this example that reverses the position of all even numbers in a subsequence:

(transform [(srange 4 11) (filterer even?)]
           reverse
           [0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15])
;; => [0 1 2 3 10 5 8 7 6 9 4 11 12 13 14 15]

Let's say you want a function that encapsulates this behavior but takes in the indices and the filtering predicate as parameters. An attempt without precompilation would look like this:

(defn reverse-matching-in-range [aseq start end predicate]
  (transform [(srange start end) (filterer predicate)]
             reverse
             aseq))

Because there's no precompilation, there's a lot of overhead in running this function. To precompile this without its parameters, you can do this:

(let [compiled-path (comp-paths srange (filterer pred))]
  (defn reverse-matching-in-range [aseq start end predicate]
    (compiled-transform (compiled-path start end predicate)
                    reverse
                    aseq)))

The compiled path takes in parameters equal to the sum of the parameters its path elements require. And since all the precompilation optimizations are applied, this code executes very fast.

We can now come back to the bank example and make an efficient implementation of the user->bank function in terms of Specter. All you have to do is take advantage of the ability to precompile paths without their parameters, like so:

(def user
  (comp-paths :people
              ALL
              (paramsfn [name]
                        [elem]
                        (= name (:name elem)))
              ))

(def user-money (comp-paths user :money))

(def BANK-MONEY (comp-paths :bank :funds))

(defn user->bank [world name amt]
  (transfer world (user-money name) BANK-MONEY amt))

That's all there is to it! Converting uncompiled paths to compiled paths is always a straightforward refactoring.

Conclusion

Specter has very close similarities to prior work, especially lenses in Haskell. I'm not intimately familiar with Haskell lenses, so I'm not sure if they're entirely equivalent. Specter has other features that weren't discussed in this post (discussed in the README) that I'm not sure are in Haskell. Any clarification from Haskell experts out there would be welcome.

The functional-navigational approach leverages the power of composition to produce more concise and declarative code. In my own work I have selectors for navigating graphs in a variety of ways: in topological order, to a subgraph (with the ability to replace the subgraph with a new subgraph, with metadata indicating how to reattach the edges to the surrounding graph), to other nodes via outgoing or incoming edges, and so on. By focusing on making generic navigators, rather than functions for the specific transformations I need, I'm able to define the transformations I need for particular cases via the composition of my generic navigators. Since the graph navigators compose with all the other navigators you've seen, the possibilities are endless (The graph navigators are a little tied to my own datatypes, so I haven't open-sourced them yet. But they are surprisingly easy to implement – only about 150 lines of code. I would love to see someone contribute a specter-graph library).

And that pretty much summarizes the functional-navigational approach. Instead of thinking in terms of specific transformations, you make generic navigators that compose to your specific transformations – plus a heck of a lot more. My major accomplishment with Specter was figuring out how to make this all blazing fast within a dynamic language.

I've loved using Clojure for the majority of my work the past five years, and Specter makes that experience even better. To me Specter really does feel like Clojure's missing piece, and I strongly believe every single Clojure/ClojureScript programmer will benefit from using it.

Monday
Oct062014

History of Apache Storm and lessons learned

Apache Storm recently became a top-level project, marking a huge milestone for the project and for me personally. It's crazy to think that four years ago Storm was nothing more than an idea in my head, and now it's a thriving project with a large community used by a ton of companies. In this post I want to look back at how Storm got to this point and the lessons I learned along the way.

The topics I will cover through Storm's history naturally follow whatever key challenges I had to deal with at those points in time. The first 25% of this post is about how Storm was conceived and initially created, so the main topics covered there are the technical issues I had to figure out to enable the project to exist. The rest of the post is about releasing Storm and establishing it as a widely used project with active user and developer communities. The main topics discussed there are marketing, communication, and community development.

Any successful project requires two things:

  1. It solves a useful problem
  2. You are able to convince a significant number of people that your project is the best solution to their problem

What I think many developers fail to understand is that achieving that second condition is as hard and as interesting as building the project itself. I hope this becomes apparent as you read through Storm's history.

Before Storm

Storm originated out of my work at BackType. At BackType we built analytics products to help businesses understand their impact on social media both historically and in realtime. Before Storm, the realtime portions of our implementation were done using a standard queues and workers approach. For example, we would write the Twitter firehose to a set of queues, and then Python workers would read those tweets and process them. Oftentimes these workers would send messages through another set of queues to another set of workers for further processing.

We were very unsatisfied with this approach. It was brittle – we had to make sure the queues and workers all stayed up – and it was very cumbersome to build apps. Most of the logic we were writing had to do with where to send/receive messages, how to serialize/deserialize messages, and so on. The actual business logic was a small portion of the codebase. Plus, it didn't feel right – the logic for one application would be spread across many workers, all of which were deployed separately. It felt like all that logic should be self-contained in one application.

The first insight

In December of 2010, I had my first big realization. That's when I came up with the idea of a "stream" as a distributed abstraction. Streams would be produced and processed in parallel, but they could be represented in a single program as a single abstraction. That led me to the idea of "spouts" and "bolts" – a spout produces brand new streams, and a bolt takes in streams as input and produces streams as output. They key insight was that spouts and bolts were inherently parallel, similar to how mappers and reducers are inherently parallel in Hadoop. Bolts would simply subscribe to whatever streams they need to process and indicate how the incoming stream should be partitioned to the bolt. Finally, the top-level abstraction I came up with was the "topology", a network of spouts and bolts.

I tested these abstractions against our use cases at BackType and everything fit together very nicely. I especially liked the fact that all the grunt work we were dealing with before – sending/receiving messages, serialization, deployment, etc. would be automated by these new abstractions.

Before embarking on building Storm, I wanted to validate my ideas against a wider set of use cases. So I sent out this tweet:

A bunch of people responded and we emailed back and forth with each other. It became clear that my abstractions were very, very sound.

I then embarked on designing Storm. I quickly hit a roadblock when trying to figure out how to pass messages between spouts and bolts. My initial thoughts were that I would mimic the queues and workers approach we were doing before and use a message broker like RabbitMQ to pass the intermediate messages. I actually spent a bunch of time diving into RabbitMQ to see how it be used for this purpose and what that would imply operationally. However, the whole idea of using message brokers for intermediate messages didn't feel right and I decided to sit on Storm until I could better think things through.

The second insight

The reason I thought I needed those intermediate brokers was to provide guarantees on the processing of data. If a bolt failed to process a message, it could replay it from whatever broker it got the message from. However, a lot of things bothered me about intermediate message brokers:

  1. They were a huge, complex piece of the architecture that would have to be scaled alongside Storm.
  2. They create uncomfortable situations, such as what to do when a topology is redeployed. There might still be intermediate messages on the brokers that are no longer compatible with the new version of the topology. So those messages would have to be cleaned up/ignored somehow.
  3. They make fault-tolerance harder. I would have to figure out what to do not just when Storm workers went down, but also when individual brokers went down.
  4. They're slow. Instead of sending messages directly between spouts and bolts, the messages go through a 3rd party, and not only that, the messages need to be persisted to disk.

I had an instinct that there should be a way to get that message processing guarantee without using intermediate message brokers. So I spent a lot of time pondering how to get that guarantee with spouts and bolts passing messages directly to one another. Without intermediate message persistence, it was implied that retries would have to come from the source (the spout). The tricky thing was that the failure of processing could happen anywhere downstream from the spout, on a completely different server, and this would have to be detected with perfect accuracy.

After a few weeks of thinking about the problem I finally had my flash of insight. I developed an algorithm based on random numbers and xors that would only require about 20 bytes to track each spout tuple, regardless of how much processing was triggered downstream. It's easily one of the best algorithms I ever developed and one of the few times in my career I can say I would not have come up with it without a good computer science education.

Once I figured out this algorithm, I knew I was onto something big. It massively simplified the design of the system by avoiding all the aforementioned problems, along with making things way more performant. (Amusingly, the day I figured out the algorithm I had a date with a girl I'd met recently. But I was so excited by what I'd just discovered that I was completely distracted the whole time. Needless to say, I did not do well with the girl.)

Building first version

Over the next 5 months, I built the first version of Storm. From the beginning I knew I wanted to open source it, so I made some key decisions in the early days with that in mind. First off, I made all of Storm's APIs in Java, but implemented Storm in Clojure. By keeping Storm's APIs 100% Java, Storm was ensured to have a very large amount of potential users. By doing the implementation in Clojure, I was able to be a lot more productive and get the project working sooner.

I also planned from the beginning to make Storm usable from non-JVM languages. Topologies are defined as Thrift data structures, and topologies are submitted using a Thrift API. Additionally, I designed a protocol so that spouts and bolts could be implemented in any language. Making Storm accessible from other languages makes the project accessible by more people. It makes it much easier for people to migrate to Storm, as they don't necessarily have to rewrite their existing realtime processing in Java. Instead they can port their existing code to run on Storm's multi-language API.

I was a long time Hadoop user and used my knowledge of Hadoop's design to make Storm's design better. For example, one of the most aggravating issues I dealt with from Hadoop was that in certain cases Hadoop workers would not shut down and the processes would just sit there doing nothing. Eventually these "zombie processes" would accumulate, soaking up resources and making the cluster inoperable. The core issue was that Hadoop put the burden of worker shutdown on the worker itself, and for a variety of reasons workers would sometimes fail to shut themselves down. So in Storm's design, I put the burden of worker shutdown on the same Storm daemon that started the worker in the first place. This turned out to be a lot more robust and Storm has never had issues with zombie processes.

Another problem I faced with Hadoop was if the JobTracker died for any reason, any running jobs would terminate. This was a real hair-puller when you had jobs that had been running for many days. With Storm, it was even more unacceptable to have a single point of failure like that since topologies are meant to run forever. So I designed Storm to be "process fault-tolerant": if a Storm daemon is killed and restarted it has absolutely no effect on running topologies. It makes the engineering more challenging, since you have to consider the effect of the process being kill -9'd and restarted at any point in the program, but it makes things far more robust.

A key decision I made early on in development was to assign one of our interns, Jason Jackson, to develop an automated deploy for Storm on AWS. This massively accelerated Storm's development, as it made it easy for me to test clusters of all different sizes and configurations. I really cannot emphasize enough how important this tool was, as it enabled me to iterate much, much faster.

Acquisition by Twitter

In May of 2011, BackType got into acquisition talks with Twitter. The acquisition made a lot of sense for us for a variety of reasons. Additionally, it was attractive to me because I realized I could do so much more with Storm by releasing it from within Twitter than from within BackType. Being able to make use of the Twitter brand was very compelling.

During acquisition talks I announced Storm to the world by writing a post on BackType's tech blog. The purpose of the post was actually just to raise our valuation in the negotiations with Twitter. And it worked: Twitter became extremely interested in the technology, and when they did their tech due-diligence on us, the entire due-diligence turned into a big demo of Storm.

The post had some surprising other effects. In the post I casually referred to Storm as "the Hadoop of realtime", and this phrase really caught on. To this day people still use it, and it even gets butchered into "realtime Hadoop" by many people. This accidental branding was really powerful and helped with adoption later on.

Open-sourcing Storm

We officially joined Twitter in July of 2011, and I immediately began planning Storm's release.

There are two ways you can go about releasing open source software. The first is to "go big", build a lot of hype for the project, and then get as much exposure as possible on release. This approach can be risky though, since if the quality isn't there or you mess up the messaging, you will alienate a huge number of people to the project on day one. That could kill any chance the project had to be successful.

The second approach is to quietly release the code and let the software slowly gain adoption. This avoids the risks of the first approach, but it has its own risk of people viewing the project as insignificant and ignoring it.

I decided to go with the first approach. I knew I had a very high quality, very useful piece of software, and through my experience with my first open source project Cascalog, I was confident I could get the messaging right.

Initially I planned to release Storm with a blog post, but then I came up with the idea of releasing Storm at a conference. By releasing at a conference:

  1. The conference would help with marketing and promotion.
  2. I would be presenting to a concentrated group of potential early adopters, who would then blog/tweet/email about it all at once, massively increasing exposure.
  3. I could hype my conference session, building anticipation for the project and ensuring that on the day of release, there would be a lot of eyes on the project.

So releasing at a conference seemed superior in all respects. Coincidentally, I was scheduled to present at Strange Loop that September on a completely different topic. Since that was when I wanted to release Storm, I emailed Alex, the Strange Loop organizer, and changed my session to be the release of Storm. As you can see from the session description, I made sure to use the Twitter brand in describing Storm.

Next, I began the process of hyping Storm. In August of 2011, a little over a month before the conference, I wrote a post on Twitter's tech blog announcing that Storm would be released at Strange Loop. In that post I built excitement for Storm by showing a lot the details of how Storm works and by giving code examples that demonstrated Storm's elegance. The post had the effect I wanted and got people really excited.

The next day I did something which I thought was really clever. I started the mailing list for Storm:

Here's why I think that was clever. A key issue you have to deal with to get adoption for a project is building social proof. Social proof exists in many forms: documented real-world usage of the project, Github watchers, mailing list activity, mailing list subscribers, Twitter followers, blog posts about the project, etc. If I had started the mailing list the day I released the project, then when people looked at it, it would have shown zero activity and very few subscribers. Potentially the project would be popular immediately and the mailing list would build social proof, but I had no guarantee of that.

By starting the mailing list before release, I was in a situation of arbitrage. If people asked questions and subscribed, then I was building social proof. If nothing happened, it didn't matter because the project wasn't released yet.

A mistake I made in those early days, which is bizarre since I was working at Twitter, was not starting a Twitter account for the project. Twitter's a great way to keep people up to date about a project as well as constantly expose people to the project (through retweets). I didn't realize I should make a Twitter account until well after release, but fortunately it didn't turn out to be that big of a deal. If I could do it again I would have started the Twitter account the same day I made the mailing list.

Between the time I wrote the post on Twitter's tech blog and the start of Strange Loop, I spent the majority of my time writing documentation for Storm. This is the single most important thing I did for the project. I wrote about 12,000 words of carefully thought out documentation – tutorials, references, API docs, and so on. A lot of open source developers don't realize how crucial docs are: people cannot use your software if they don't understand it. Writing good documentation is painful and time-consuming, but absolutely essential.

The moment of truth came on September 19th, 2011. I had fun with the release. I started my talk by saying I had been debating whether to open source Storm at the beginning of my talk, starting things off with a bang, or the end of my talk, finishing on an exclamation point. I said I decided to get the best of both worlds by open sourcing Storm right in the middle of my talk. I told the audience the time of the exact middle of my talk, and told them to shout out at me if I hadn't open sourced it by that time. As soon as that moment came, the audience shouted at me and I released the project.

Everything went according to plan. The project got a huge amount of attention and got over 1000 Github watchers on the first day. The project went to #1 on Hacker News immediately. After my talk, I went online and answered questions on Hacker News, the mailing list, and Twitter.

The aftermath of release

Within four days Storm became the most watched Java, Scala, or Clojure project on Github. Within two weeks, spider.io announced they already had Storm in production. I thought that was incredible and a testament to the high quality of the project and docs at release.

As soon as Storm was released I started getting feedback from people using the project. In the first week I made three minor releases to address quality of life issues people were having. They were minor but I was focused on making sure everyone had the best experience possible. I also added a lot of additional logging into Storm in that first week so that when people ran into issues on the mailing list, they could provide me more information on what was going on.

I didn't anticipate how much time it would take to answer questions on the mailing list. The mailing list had a ton of activity and I was spending one to two hours a day answering questions. Part of what made the mailing list so time-consuming is how bad most people are at asking questions. It's very common to get a question like this: "I'm having a lot of tuple failures. Why??" Most of the time there's an easy fix as typically the user had something strange going on with how they configured or were using Storm. But I would have to spend a ton of time asking the user follow-up questions to get them to provide me the information they already had so I could help them. You'd be amazed at how often a user fails to tell you about something really bizarre they did, like running multiple versions of Storm at once, manually editing the files Storm daemons keep on local disk, running their own modified version of Storm, or using a shared network drive for the state of Storm daemons. These endless hours I spent on the mailing list became very draining (especially since at the same time I was building a brand new team within Twitter) and I wouldn't get relief for well over a year.

Over the next year I did a ton of talks on Storm at conferences, meetups, and companies. I believe I did over 25 Storm talks. It got to a point where I could present Storm with my eyes closed. All this speaking got Storm more and more exposure.

The marketing paid off and Storm acquired production users very quickly. I did a survey in January of 2012 and found out Storm had 10 production users, another 15 planning to have it in production soon, and another 30 companies experimenting with the technology. To have that many production users for a major piece of infrastructure in only 3 months since release was very significant.

I set up a "Powered By" page for Storm to get that last piece of critical social proof going. Rather than just have a list of companies, I requested that everyone who listed themselves on that page include a short paragraph about how they're using it. This allows people reading that page to get a sense of the variety of use cases and scales that Storm can be used for. I included a link to my email on that page for people who wanted to be listed on it. As I did the tech talk circuit, that page continued to grow and grow.

Filling the "Powered By" page for a project can be frustrating, as there can be a lot of people using your project that you're not aware of. I remember one time I got an email from one of the biggest Chinese companies in the world asking to be listed on Storm's Powered By page. They had been using Storm for over a year at that point, but that whole time I had no idea. To this day I don't know the best way to get people to tell you they're using your software. Besides the link to my email on Storm's Powered By page, the technique I've used is to occasionally solicit Powered By submissions via Twitter and the mailing list.

Technical evolution of Storm

Storm is a far more advanced project now than when it was released. On release it was still very much oriented towards the needs we had at BackType, as we had not yet learned the needs of larger companies for major infrastructure. Getting Storm in shape to be deployed widely within Twitter drove its development for the next 1.5 years after release.

The technical needs of a large company are different than a startup. Whereas at a startup a small team manages the entire stack, including operations and deployment, in a big company these functions are typically spread across multiple teams. One thing we learned immediately within Twitter is that people didn't want to run their own Storm clusters. They just wanted a Storm cluster they could use with someone else taking care of operations.

This implied that we needed to be able to have one large, shared cluster running many independent applications. We needed to ensure that applications could be given guarantees on how many resources they would get and make sure there was no possible way one application going haywire would affect other applications on the cluster. This is called "multi-tenancy".

We also ran into process issues. As we built out the shared cluster, we noticed that pretty much everyone was configuring their topologies to use a huge number of resources – way more than they actually needed. This was making usage of the cluster very inefficient. The problem was that no one had an incentive to optimize their topologies. People just wanted to run their stuff and have it work, so from their perspective there was no reason not to request a ton of resources.

I solved both these issues by developing something called the "isolation scheduler". It was an incredibly simple solution that provided for multi-tenancy, created incentives for people to use resources efficiently, and allowed a single cluster to share both production and development workloads.

As more and more people used Storm within Twitter, we also discovered that people needed to monitor their topologies with metrics beyond what Storm captures by default. That led us to developing Storm's excellent metrics API to allow users to collect completely custom, arbitrary metrics, and send those metrics to any monitoring system.

Another big technical jump for Storm was developing Trident, a micro-batching API on top of Storm that provides exactly-once processing semantics. This enabled Storm to be applied to a lot of new use cases.

Besides all these major improvements, there were of course tons of quality of life improvements and performance enhancements along the way. All the work we were doing allowed us to do many releases of Storm – we averaged more than one release a month that first year. Doing frequent releases is incredibly important to growing a project in the early days, as each release gives you a boost of visibility from people tweeting/talking about it. It also shows people that the project is continuously improving and that if they run into issues, the project will be responsive to them.

Building developer community

The hardest part about building an open source project is building the community of developers contributing to the project. This is definitely something I struggled with.

For the first 1.5 years after release, I drove all development of Storm. All changes to the project went through me. There were pros and cons to having all development centered on me.

By controlling every detail of the project, I could ensure the project remained at a very high quality. Since I knew the project from top to bottom, I could anticipate all the ways any given change would affect the project as a whole. And since I had a vision of where the project should go, I could prevent any changes from going in that conflicted with that vision (or modify them to be consistent). I could ensure the project always had a consistent design and experience.

Unfortunately, "visionary-driven development" has a major drawback in that it makes it very hard to build an active and enthusiastic development community. First off, there's very little room for anyone to come in and make major contributions, since I am controlling everything. Second, I am a major bottleneck in all development. It became very, very hard to keep up with pull requests coming in (remember, I was also building a brand new infrastructure team within Twitter at the same time). So people would get discouraged from contributing to the project due to the incredibly slow feedback/merge cycle.

Another drawback to centering development on myself was that people viewed me as a single point of failure for the project. People brought up concerns to me of what would happen if I got hit by a bus. This concern actually limited the project less than you would think, as Storm was adopted by tons of major companies while I was at the center of development, including Yahoo!, Groupon, The Weather Channel, WebMD, Cerner, Alibaba, Baidu, Taobao, and many other companies.

Finally, the worst aspect to centering development on myself was the burden I personally felt. It's a ton of pressure and makes it hard to take a break. However, I was hesitant to expand control over project development to others because I was worried about project quality suffering. There was no way anyone else would have the deep understanding I did of the entire code base, and inevitably that would lead to changes going in with unintended consequences. However, I began to realize that this is something you have to accept when expanding a developer community. And later on I would realize this isn't as big of a deal as I thought.

Leaving Twitter

When I left Twitter in March of 2013 to pursue my current startup, I was still at the center of Storm development. After a few months it became a priority to remove myself as a bottleneck to the project. I felt that Storm would be better served with a consensus-driven development model.

I think "visionary-driven development" is best when the solution space for the project hasn't been fully explored yet. So for Storm, having me controlling all decisions as we built out multi-tenancy, custom metrics, Trident, and the major performance refactorings was a good thing. Major design issues can only be resolved well by someone with a deep understanding of the entire project.

By the time I left Twitter, we had largely figured out what the solution space for Storm looked like. That's not to say there wasn't lots of innovation still possible – Storm has had a lot of improvements since then – but those innovations weren't necessarily surprising. A lot of the work since I left Twitter has been transitioning Storm from ZeroMQ to Netty, implementing security/authentication, improving performance/scalability, improving topology visualizations, and so on. These are all awesome improvements but all of which were already anticipated as directions for improvements back in March of 2013. To put it another way, I think "visionary-driven development" is necessary when the solution space still has a lot of uncertainty in it. When the solution space is relatively well understood, the value of "visionary-driven development" diminishes dramatically. Then having that one person as a bottleneck seriously inhibits growth of the project.

About four months before leaving Twitter, Andy Feng over at Yahoo! started pushing me hard to submit Storm to Apache. At that point I had just started thinking about how to ensure the future of Storm, and Apache seemed like an interesting idea. I met with Doug Cutting, the creator of Hadoop, to get his thoughts on Apache and potentially moving Storm to Apache. Doug gave me an overview of how Apache works and was very candid about the pros and cons. He told me that the incubator could be chaotic and would most likely be painful to get through (though in reality, it turned out to be an incredibly smooth process). Doug's advice was invaluable and he really helped me understand how a consensus-driven development model works.

In consensus-driven development, at least how its done by many Apache projects, changes are voted into a project by a group of "committers". Typically all changes require at least two +1 votes and no -1 votes. That means every committer has veto power. In a consensus-driven project, not every committer will have a full understanding of the codebase. Many committers will specialize in different portions of the codebase. Over time, some of those committers will learn a greater portion of the codebase and achieve a greater understanding of how everything fits together.

When Storm first transitioned to a consensus-driven model, most of the committers had relatively limited understandings of the codebase as a whole and instead had various specialized understandings of certain areas. This was entirely due to the fact I had been so dominant in development – no one had ever been given the responsibility where they would need to learn more to make good decisions. By giving other people more authority and stepping back a bit, my hope was that others would fill that void. And that's exactly what happened.

One of my fears when moving to consensus-driven development was that the quality of changes would drop. And indeed, some of the changes that went in as we transitioned had some bugs in them. But this isn't a big deal. Because you'll get a bug report, and you can fix the problem for the next release. And if the problem is really bad, you can cut an emergency release for people to use. When I was personally making all development decisions, I would thoroughly test things myself and make use of my knowledge of the entire codebase such that anything that went out in a release was extremely high quality. But even then, my code sometimes had bugs in it and we would have to fix them in the next release. So consensus-driven development is really no different, except that changes may require a bit more iteration to iron out the issues. No software is perfect – what's important is that you have an active and responsive development community that will iterate and fix problems that arise.

Submitting to Apache

Getting back to the history of Storm, a few months after leaving Twitter I decided I wanted Storm to move to a consensus-driven development model. As I was very focused on my new startup, I also wanted Storm to have a long-term home that would give users the confidence that Storm would be a thriving project for years to come. When I considered all the options, submitting Storm to Apache seemed like far and away the best choice. Apache would give Storm a powerful brand, a strong legal foundation, and exactly the consensus-driven model that I wanted for the project.

Using what I learned from Doug Cutting, I eased the transition into Apache by identifying any legal issues beforehand that would cause problems during incubation. Storm made use of the ZeroMQ library for inter-process communication, but unfortunately the licensing of ZeroMQ was incompatible with Apache Foundation policy. A few developers at Yahoo! stepped up and created a replacement based on Netty (they all later became Storm committers).

In forming the intitial committer list for Storm, I chose developers from a variety of companies who had made relatively significant contributions to the project. One person who I'm super glad to have invited as a committer was Taylor Goetz, who worked at Health Market Science at the time. I was on the fence about inviting him since he hadn't contributed much code at that point. However, he was very active in the community and mailing list so I decided to take a chance on him. Once becoming a committer, Taylor took a huge amount of initiative, relieving me of many of the management burdens of the project. During incubation he handled most of the nitty-gritty stuff (like taking care of certain legal things, figuring out how to move the website over to Apache, how to get permissions for new committers, managing releases, calling for votes, etc.). Taylor later went to Hortonworks to work on Storm full-time, and he did such a good job helping shepherd Storm through the incubator that he is now the PMC chair for the project.

In September of 2013, with the help of Andy Feng at Yahoo!, I officially proposed Storm for incubation in Apache. Since we were well-prepared the proposal went through with only some minor modifications needed.

Apache incubation

During incubation we had to demonstrate that we could make releases, grow the user community, and expand the set of committers to the project. We never ran into any problems accomplishing any of these things. Once Storm was in incubation and I was no longer a bottleneck, development accelerated rapidly. People submitting patches got feedback faster and were encouraged to contribute more. We identified people who were making significant contributions and invited them to be committers.

Since incubation I've been just one committer like any other committer, with a vote no stronger than anyone else. I've focused my energies on any issues that affect anything core in Storm or have some sort of difficult design decision to work out. This has been a much more efficient use of my time and a huge relief compared to having to review every little change.

Storm officially graduated to a top-level project on September 17th, 2014, just short of three years after being open-sourced.

Conclusion

Building Storm and getting it to where it is now was quite a ride. I learned that building a successful project requires a lot more than just producing good code that solves an important problem. Documentation, marketing, and community development are just as important. Especially in the early days, you have to be creative and think of clever ways to get the project established. Examples of how I did that were making use of the Twitter brand, starting the mailing list a few months before release, and doing a big hyped up release to maximize exposure. Additionally, there's a lot of tedious, time-consuming work involved in building a successful project, such as writing docs, answering the never-ending questions on the mailing list, and giving talks.

One of the most amazing things for me has been seeing the huge range of industries Storm has affected. On the Powered By page there are applications listed in the areas of healthcare, weather, analytics, news, auctions, advertising, travel, alerting, finance, and many more. Reading that page makes the insane amount of work I've put into Storm feel worth it.

In telling this story, I have not been able to include every detail (three years is a long time, after all). So I want to finish by listing many of the people who have been important to Storm getting to the point it is at today. I am very grateful to all of these people: Chris Aniszczyk, Ashley Brown, Doug Cutting, Derek Dagit, Ted Dunning, Robert Evans, Andy Feng, Taylor Goetz, Christopher Golda, Edmund Jackson, Jason Jackson, Jeff Kaditz, Jennifer Lee, Michael Montano, Michael Noll, Adrian Petrescu, Adam Schuck, James Xu, and anyone who's ever contributed a patch to the project, deployed it in production, written about it, or given a presentation on it.

You should follow me on Twitter here.

Monday
Jun232014

The Entrepreneur Who Captivated Me

On January 12th, 2012 I went to Airbnb to give a talk about Storm at their tech talk series. It was a public event and quite a few people came. As usual at these events, there was mingling/socializing after the talk. A bunch of people approached me and one of these people was Jake Klamka.

There's nothing striking about Jake Klamka when you first meet him. He's a pretty unassuming fellow. So when he approached me after my talk, little did I know I was about to meet an astonishingly good entrepreneur – someone who's a master at getting help from others. Over the next few years, little by little, he would get me to help him more and more. And I would be amazed at what an awesome company he would build.

People approach me all the time with various requests. And the vast majority of them go about it the wrong way. For example, I often get emails – from complete strangers that I don't know – asking to meet with me for an hour or have a conference call with me. The reason they want to meet is typically vague – "I want to find out more about Storm" or "I'd like to get your advice on dealing with Big Data". In the past I have taken some of these meetings, and they are almost always a waste of time for both of us. They typically come to me with generic questions that could easily have been answered by doing a Google search, reading my blog, or watching one of my talks. I want to help people, but coming to me with questions that you could easily have answered yourself with a slight amount of research is not productive.

Jake puts a lot more thought into how he asks for help. A lot of what I learned about getting help from others I learned from how Jake slowly reeled me in to helping him. When he approached me at the Airbnb meetup, he didn't start with what he was working on and what he wanted from me. On the contrary, he asked about me and what I was up to. He told me he'd been following my work for awhile and asked me how I liked working at Twitter post-acquisition (I was part of BackType which was acquired by Twitter 6 months before). He asked me about how I came up with Storm. He did not attempt to shift the conversation to himself at all. A clever fellow, this Jake.

Naturally, I felt compelled to know more about this person who was flattering me by showing so much interest in my life. So I asked him what he was up to and what he was working on. He smoothly described to me how he had gone through Y Combinator but was now working on something new. He told me about his idea to form a fellowship program (which he later called Insight Data Science) to train PhD's with physics, math, and other backgrounds the skills necessary to become data scientists. The idea was that the PhD's were a surplus of very smart people without the greatest job prospects, and at the same time there was huge unsatisfied demand for data scientists in the industry. The PhD's already had the statistical skills necessary for data science – they just needed to learn the tooling of the industry. It was a smooth pitch and I told him I thought it was a pretty interesting idea.

In retrospect, I believe that he came prepared to pitch me on this idea, to gauge my interest, and to see if he could get me involved in helping out with the program. But here's the key: he didn't force himself on me. Rather than approach me and pitch me, he took advantage of natural social dynamics and waited until I asked about him. Not until then did he give his pitch. And not until I told him I liked his idea did he make his ask.

Jake asked if he could come by Twitter for lunch one day and go over his plans for Insight. It was a minimal commitment and one which he made as convenient as possible for me (by coming to where I work). Unlike the requests for 1 hour meetings I get from completely random people, this was appealing because it was a request from someone who had demonstrated he was doing something interesting and who was asking for specific feedback. So I was glad to take the meeting.

Jake came by Twitter for lunch and was super prepared. He brought a slide deck on his iPad and very clearly took me through the specifics of how Insight would work, how he was going to recruit students, and how he would connect them to industry. By making everything so tangible he made it incredibly easy for me to give him feedback. I told him it was important that he get companies involved in the program from the start because of how crucial it was to keep the students connected with the needs of industry. It was an extremely productive meeting and I felt like I provided a lot of value. I was impressed with Jake and outright told him I'd be happy to help him more in the future.

Over the next few months I helped Jake out with a few things – mostly related to connecting him to people within Twitter who might want to hire out of his program. That summer he launched the first batch of the program and asked if I would come down to Palo Alto to give a talk to the students. I was happy to do so.

That went well and the first batch of the program went on to be a huge success. Insight achieved – and continues to achieve – a 100% placement rate for all its fellows. Fellows get hired at all the top companies like Netflix, Twitter, Facebook, LinkedIn, Square, Microsoft, etc. It's a really impressive program. The most mind boggling part about it is that it's completely free for the fellows.

Jake and I occasionally kept in touch over the next year and a half. Then this February, he emailed me to update me on the program (it had grown a lot) and ask if he could meet with me again to talk about his idea for a new data engineering program. Unlike data science, this program would be geared towards people who want to build the pipelines underlying data products and the infrastructure that data scientists use. Like usual, Jake made the meeting as convenient as possible for me by coming to meet me near where I live in San Francisco.

I was skeptical about the program at first. While the data science program took advantage of a distinct imbalance – a surplus of smart PhD's and a deficit of data scientists – I wasn't entirely sure who were the target candidates for data engineering. However, as we talked I realized what a good idea this was. Data engineers are in huge demand, and getting skilled at data engineering is a ticket to a job that's not only high paying, but also supremely interesting – imagine playing with the datasets of Twitter, Netflix, Spotify, or Khan Academy. Everyone seems to have a "Big Data" problem now, and data engineers are crucial to solving those problems.

Besides that, it also became apparent that Insight is valuable in and of itself, not just as a ticket to a job in industry. Jake has created a cradle of creativity – the entire program is structured around fellows doing self-directed projects and helping each other learn and grow. The data engineering program seemed like a fantastic opportunity for any programmer to sharpen their skills and get an awesome job, whether a new grad or someone experienced looking to transition their career.

In our meeting Jake's biggest question mark was what kinds of people would make good fellows for the data engineering program. I felt strongly that you won't turn a non-programmer or an inexperienced programmer into a good programmer in 6 weeks, so the candidates he should go after should already be good programmers. The program should be focused on teaching fellows the tools and techniques of data engineering, not on teaching them basic engineering skills. This was a big difference from the data science program, and Jake and I spent a long time talking about how the interview process for Insight Data Engineering should work.

A few weeks later, Jake asked if I would become an official advisor to Insight. I accepted instantly. I found Insight to be hugely impressive, and Jake had long ago proven to me how resourceful and effective he is. Most importantly, it's always a joy to work with Jake because he makes sure that every meeting is productive. He asks for specific feedback and never asks overly vague questions. As an advisor to the program, I'm helping fellows with their projects and holding sessions to teach fellows various aspects of data engineering (like Lambda Architecture and Storm). I'm excited to see the program improve and evolve as it takes on more batches of students in the future.

It's fun to look back at how Jake reeled me in to helping out with Insight. He didn't just come up to me with a list of things he wanted – instead he first built my interest in him and what he was doing. Then he gradually escalated the commitments he asked for – first a lunch meeting where I work, then some introductions, then a visit to the Insight office, then help with evaluating candidates, then a regular commitment to spend time with the fellows. At each stage he made sure I was sufficiently interested such that taking on the commitment would be something I'm excited about – and not just a favor. That's what makes Jake such a great entrepreneur – he makes people who were total strangers want to go out of their way to help him. I think everyone can learn from Jake's example, as I certainly have.

Saturday
May102014

Why we in tech must support Lawrence Lessig

I'm an entrepreneur and a programmer. I've been fortunate to work in an industry that has seen incredible growth the past 10 years. It's amazing that an entrepreneur can launch services that reach millions of people with very, very small amounts of capital. Startups can compete with established services on a level playing field because the internet does not discriminate between different services. The internet is neutral. This has enabled an explosion of services that has provided massive amounts of value to the entire world.

I'm not here to convince you of the importance of net neutrality. This has been done thoroughly here, here, here, and here. Instead I want to talk about a much deeper issue.

Losing net neutrality would be extremely harmful to our society and our economy, and it's not hard to see this. And yet, the government seems to have a lot of trouble understanding this. The government could fix this problem instantly by reclassifying the internet from an "information service" to a "telecommunications service". Why don't they?

Whenever I'm fixing a bug in a software system, it's important I get to the root cause of the bug. If I only fix one particular manifestation of the bug, the bug will just pop up again in a different form. Something similar is going on with our government. I believe that Lawrence Lessig has thoroughly and articulately elucidated the root cause of our government's seeming stupidity, and that anything less than fixing that root cause is a losing battle. But I'll get back to that.

Government "stupidity"

It's not just net neutrality. Our government has made what appears to be stupid decisions on many, many issues. Consider copyright extension. The government consistently and frequently extends copyrights on existing works, preventing those works from entering the public domain. It is wrong in every single case to extend a copyright, as the only reason for having copyrights is to incentivize the creation of works in the first place.

Now consider your taxes. Have you ever wondered why you have to fill out a tax return? For most people, the government already has all the information needed to automatically fill out your taxes. A bill was proposed in 2011 that would have made this a reality – to give you a pre-filled tax return that you could then adjust on your own if there were any problems. It would have been cheap to implement, voluntary to use, and would have saved billions of dollars in tax preparation costs and millions of hours in time. Pre-filled tax returns have been shown to work in other countries like Denmark, Sweden, and Spain. Yet, the bill died.

These are just simple and easy to understand examples, There are endless examples of terrible government decisions in climate change, healthcare, the lead up and response to the 2008 financial crisis, agricultural subsidies, and pretty much any area you can think of.

It's easy and convenient to conclude that these decisions are the result of members of our government being stupid. If that's the conclusion you've made, I don't blame you. It's a conclusion I've made in the past, and it made me completely apathetic to government. It made me believe that's just the way government is, and we have to work around that to live our lives.

The root cause

But let's consider an alternative explanation. It's not easy to become a representative or a senator. Getting elected is a brutally competitive game – you don't win by being stupid. You get there by being smart and cunning.

I believe that the decisions made by our elected officials are calculated and smart – from their perspective. They want you to think they're dumb, that that's just the way government works. The more apathy in the citizenry and the more people believe that's just the way things are, the less likely people will challenge the status quo.

That brings us back to Lawrence Lessig. I don't think anyone has done a better job at explaining what is going on with our government than he has.

His explanation is very simple. In order to get elected in America, you need a lot of money to run an election campaign. The vast majority of that campaign money comes from a tiny fraction of Americans - only 0.05% of people. If you don't get the support of those people, you won't have enough money to realistically run for office. So that tiny fraction of funders selects the people we vote for in elections. And naturally, the funders only select people who favor their interests.

So you end up with a government that represents the interests of the funders and not the people as a whole. You end up with a government full of people who are good at catering to the interests of people who can give them the money to get elected/re-elected. The system selects for those kinds of people.

This is how companies like Comcast and AT&T influence policies enacted by Congress. They spend tons of money lobbying to get laws passed to limit competition and preserve their monopoly status. They lobbied extremely heavily to get Congress to put pressure on the FCC to prevent them from classifying the Internet as a telecommunications service, which would have ensured net neutrality. This article and this video explain the connection between lobbying and net neutrality very well.

There is only one solution to this problem – funding for elections must be spread out so that "the funders" equals "the people". Lessig has many proposals for how to do this. One idea is for everyone to get a small voucher (say $50) to donate to candidates of their choice. That $50 comes out of the first $50 in taxes you paid that year. This would create many billions of campaign dollars every year, more than enough to spread the influence of money across the whole population.

Lessig has explained the influence of money in politics much better than I ever could, so check out his excellent TED talk and his extremely well-written book.

Our apathy is misguided

When I talk to people about this problem, almost no one disagrees. There's almost universal agreement that government is beholden to special interests, that the way campaigns are financed is hugely corrupting. And yet almost everyone is apathetic and believes it's impossible to change the system. I'm reminded of George Orwell's "Animal Farm". After enough time, a way of life becomes the way of life. You forget that things can work differently, that things have worked differently in the past.

I don't want to live in a world where the government is constantly trying to trade off the public good for the benefit of a special few. Instead I want to live in a world where the government represents and works for the people. I know, that sounds naive. But if you don't strive for it, you most assuredly won't achieve it. Much, much crazier things have happened.

This issue is so important and touches so many aspects of our society that I believe it's our duty as citizens to fight for change any way we can. We have to support people who are working day and night on this, who have excellent ideas on how to achieve reform. Lawrence Lessig is certainly one of those people.

Supporting Lawrence Lessig

Lawrence Lessig's latest initiative is called the MayDay PAC, the "SuperPAC to end all SuperPACs". The MayDay PAC is raising money to get people elected who are committed to eliminating the unbalanced influence of special interests. Once you get past the irony of it, it's actually a brilliant idea. Check it out and read the FAQ.

So yes, at the end of all this I'm asking you to make a political donation – $10, $100, or whatever you can afford. I'm not involved in politics in any way and am not affiliated with Lessig or the MayDay PAC. I'm just a concerned citizen. At worst, you'll be losing a couple bucks. At best, you'll be helping enable one of the most important reforms, if not the most important reform, in a generation.

If you're still not convinced, here are links to the MayDay PAC and many of the excellent talks and writings produced by Lessig:

Monday
Feb242014

The inexplicable rise of open floor plans in tech companies

Update: I originally quoted the average price of office space as $36 / square foot / month, where in reality it's per year. So I was accidentally weakening my own argument! The post has been updated to reflect the right number.

The "open floor plan" has really taken over tech companies in San Francisco. Offices are organized as huge open spaces with row after row of tables. Employees sit next to each other and each have their own piece of desk space. Now, I don't want to comment on the effectiveness of open floor plans for fields other than my own. But for software development, this is the single best way to sabotage the productivity of your entire engineering team.

The problem

Programming is a very brain-intensive task. You have to hold all sorts of disparate information in your head at once and synthesize it into extremely precise code. It requires intense amounts of focus. Distractions and interruptions are death to the productivity of a programmer. And an open-floor plan encourages distractions like nothing else.

First of all, you're in a room with dozens and dozens of other people. That's naturally going to be very noisy. People are talking all over the room. The person next to you is chewing on potato chips. You constantly hear people getting up and walking around. Hopefully you at least have carpeted floors, or else it's going to be REALLY loud. All day long doors to conference rooms are opening and closing. Noise breaks concentration. Broken concentration breaks productivity. If you're lucky you're one of those people who can work by drowning out the noise with music through your headphones. Otherwise, you're out of luck.

Even worse than the noise is the fact that you are very easy to interrupt in an open floor plan. People tap you on the shoulder to ask you questions. Now maybe you're different than me, but I find it pretty hard to focus when someone starts talking to me as I'm working. It's frequent enough in an open floor plan that even just the potential of that happening hurts my concentration.

There's evidence that open plan offices make it more likely for people to get sick. This is not really that surprising as with a big open space you'd expect germs to spread more easily. Besides that, the lack of privacy also bothers a lot of people.

I can't tell you how many times I've heard this comment from programmers: "I get most of my work done once most people have left the office and I can concentrate". This translates to "I can't do work during normal business hours!" Think about that. This is truly absurd. You should be able to work during working hours.

The "collaboration" justification

The most common justification I hear about the open floor plan is that it "encourages collaboration". Now it's true, the open floor plan does create the occasional opportunity for collaboration. You might overhear someone the row over talking about how they need to do some particular load testing of a database, and then you jump in with how you built such a tool for such a purpose. Everyone says "Hurrah!" and something truly valuable occurred. But in reality, these moments of true serendipitity are few and far between.

The tradeoff for these moments is that all your working hours are now sabotaged by non-stop distractions that ruin your productivity. The primary task of a programmer is writing code, which involves sitting at a desk and thinking and typing. Code is not written in these supposed spontaneous acts of collaboration. A working environment should make your primary tasks as easy as possible, which for programming means encouraging focus and concentration.

Cost-effectiveness of open floor plans

Let's be honest though. Open floor plans are done because it's the most cost-effective way to squeeze as many people into one space as possible. Space is expensive so you should make as best use of it as you can. But does minimizing space REALLY minimize cost? Because programmers aren't call center workers. They're very expensive, so sabotaging their productity is thereby very expensive. Let's play with some numbers to see how this plays out.

I'm not sure on the exact numbers, but in San Francisco a programmer probably costs you on average $100K a year in salary. With benefits the total cost is in the neighborhood of $120K. So a programmer is a $10K / month investment.

The average price per square foot per month of an office in San Francisco is $36 / year, or $3 / month. But let's say the average rate is $10 a month, since more expensive rates favor open floor plans, and I want to drive the point home. If you actually look at a sample of rates for New York and San Francisco, you'll see that in reality almost no offices are nearly as expensive as $10 / month.

With an open floor plan, let's say a programmer takes up an average of 6ft x 6ft of space. So the cost of space per programmer per month is 10 * 6 * 6 = $360 / month. This means the cost per programmer including space is $10360 / month.

Let's say that in a non open-floor plan each programmer requires four times the amount of space as an open-floor plan environment – an average of 12ft x 12ft. That's actually quite a lot of space. In this case, the cost of space per programmer per month results in $1440 / month, making the cost of a programmer including space $11440 / month. This makes a non-open-floor-plan programmer 10.1% more expensive than an open-floor-plan programmer, or put another way an open-floor-plan programmer is 9.2% less expensive than a non-open-floor plan programmer.

So on a per programmer basis, if the open floor plan lowers productivity by less than 9.2%, it's worth it. But this seems overly optimistic. In my experience working in an open floor plan my productivity is cut by half or worse. Plus there are things I literally am unable to do in such an office because they require too much focus. So my own estimate of my productivity decrease in such an office could be closer to a 75% decrease!

This analysis doesn't even take into account that if your programmers are more productive, you need less programmers. You only need half the programmers if your programmers are twice as effective. This drives your space needs in half, vastly skewing the numbers further in favor of non-open-floor plans. Unless my estimates of productivity decrease or space needs are way, way off, the open floor plan is not even close to worth it.

Alternatives

I don't know what the "best" arrangement is for an office for programmers. Perhaps it's 1 person offices, 2 person offices, or 3-5 person offices. Perhaps offices combined with open "collaboration areas" is the right approach. But certainly the open floor plan as practiced today is not it.

It might be possible to establish a culture that enforces a library-like environment on an open floor plan. If talking too loudly got you "shushed", then it would certainly be a lot quieter and easier to concentrate. I don't know of any company that has created a culture like this, but it's certainly an interesting idea. Personally though I think that creating a good work environment through physical means will be much more robust than doing so through cultural means.

Of course, having an alternative environment that allows for focus and concentration does not mean that spontaneous collaboration goes away. Because, surprise! – your employees still interact with each other at lunch, in the kitchen, in meetings, and in all the other natural places that people socialize.

Conclusion

Another thing that people like about the open floor plan is that it "looks good" and has the "startup feel". Well, I don't know about you, but to me the startup feel is about shipping quickly and getting things done. I would greatly, greatly prefer an office environment that helped me do that rather than get in my way.

The open floor plan really only works when you're really small, when it's essentially equivalent to one of those "5 man offices". But once you start to get bigger, say the 15 person range, it starts becoming unwieldy. Once you're big enough and have the resources to customize the office environment, I think it's incredibly important to find an office environment that works better.

Here's an idea. Establish a culture that encourages employees to work from home as much as they want. They should really understand that face time is completely irrelevant. Then, measure how many people come in each day. Do polls to see how productive people find the office. If the numbers are low, then there's something wrong with your office environment. This puts the burden on you, as the employer, to make an office environment that people actually want to work in.

You should follow me on Twitter here.