Applying the "Five Whys" to the Client Diversity problem
The most recent All Core Devs call #90 was almost entirely spent discussion a single issue. I would highly recommend going and listening to the call yourself. On the call Alexey brought up the issue of burnout among client developers. While I think the conversation was an important beginning, I also think that it quickly moved towards solutions before adequately understanding the problem. It's important that we take the time to map out the problem space. One simple but effective framework for exploring a problem space is Five Whys. So without further ado, let us examine our first "why".
Why does the Geth team operate under stressful conditions that lead to burn out?
Looking at statistics on client market share from etherscan we see the following distribution.
- Geth: 75%
- Parity & OpenEthereum: 20%
- Nethermind: 1%
The important part here is that Geth's market share is greater than 51% of hashpower. Suppose that in the upcoming Berlin hard fork, Geth's implementation of an included EIP has a bug. Even if all other client implementations do not have the bug, upon encountering a block which hits this bug, the network would fork. This block should be invalid, and would be considered invalid by all of the other clients, but with well over 51% of all mining nodes running Geth, the network as a whole will follow the wrong chain.
This dynamic places an extremely high requirement of correctness on the Geth client and team. So the answer to our first "why" is: Because the network lacks sufficient client diversity.
It is worth noting that client diversity won't suddenly make client development a stress free job. This is still an area that is worth exploring on its own to try and find ways to make client development more rewarding and less burdensom, but we also must acknowledge that we likely cannot meaningfully address the problem by simply focusing on the Geth team.
Why does the Ethereum network lack client diversity?
When the Ethereum mainnet launched we had multiple clients. The most prominent ones were Geth and CPPEthereum. Parity joined the party a bit later, and CPPEthereum fell away. Since then, nobody other than Parity has managed to gain any significant market share. During the last year, Nethermind showed up and is looking promising, however, they are currently sitting at 1% market share. And recently, Parity's market share has declined significantly due to some turmoil and uncertainty about the project's future. Our ideal situation is a network with 3 or more clients each holding a non-trivial percentage of the market, and with no client holding greater than 51%. The ideal situation is client plurality, but we've settled into a client super-majority. So why don't we have more clients? I can personally speak from experience that building an Ethereum client is prohibitively difficult. Geth is full of complex optimizations which allow it to operate on the network. It has taken the Geth team years to build out this level of complexity, and they continue to optimize today. Some might be quick to suggest that we find ways to provide support and assistance to the underdog clients. I'm wary of solutions that rely on the existence of the mythical person-month-- throwing more engineers at a hard problem has rarely succeeded in software development, and I don't expect it to here. Instead, I think it's appropriate to focus on the complexity itself.
Why is it difficult to build an Ethereum client?
Now we are getting closer to the root of the problem. It turns out that a lot of the difficulty exists in the neworking protocol, i.e. the set of tools that Ethereum clients use to connect to each other and share information about the blockchain. The networking rules of Ethereum (defined in devp2p) end up affecting and even dictating the design and requirements for an Ethereum client. Some of the networking tools dictate a sub-optimal architecture, or even require functionality that may not be necessary for the client to operate. Client developers need to work within those constraints.
Why does the networking protocol make client implementation difficult?
I believe the answer to this question falls roughly into two categories.
- State management
- Monolithic networking requirements
For state management, Ethereum clients are required to be able to both sync the full state over the network as well as maintain a local copy of the state. Both of these are difficult. Syncing the state involves making millions of requests and saturating disk I/O, for both the syncing client and the server reading and serving the state requests. The newly synced state then needs to be maintained and pruned to keep the database fast enough to execute new blocks. This is a serious engineering challenge! Our only networking tool to sync state, GetNodeData is optimized for one specific state database format (tellingly called "the naive layout"). The 'flat' database layout popularized by Turbo Geth has major performance benefits for maintaining the state (around 44GB at the time of writing), but using that layout makes it more difficult to serve GetNodeData requests over the network. When we shift our focus to the networking stack, specifically the DevP2P ETH protocol, we find other things that increase client complexity. In order to be apart of this network a client needs to be capable of:
GetNodeDatarequests for arbitrary state access for recent blocks.
- Serving requests for the entire history of the chain data, including headers, block bodies, and receipts.
The underlying data necessary to serve these requests is not fundamentally necessary for many client operations, but currently it is compulsory to support these features. This requires all clients to build out a large amount of functionality that may not actually be necessary for the primary purpose the client aims to serve. For example, a client that is primarily acting as a gateway for sending transactions does not need the historical chain data, and likely only needs a small subset of the state, but in the present version of Ethereum, it must still keep a full copy.
It seems I only need 4 "whys" to get to the root cause. The Ethereum protocol has not aged well. At the time it was designed, most of the problems we see today were not understood or they simply were not problems yet because the state size was smaller and the chain history was shorter.
I have spent most of the last year focused on this problem. I continue to be amazed at how many problems within Ethereum can be traced back to the underlying networking layer. Maybe the most prominent example of this is that disk I/O has historically been a bottleneck for clients. The bottleneck exists because clients tended to implement their state databases using the naive representation of the trie. This choice for how the state database is constructed is dictated by the GetNodeData networking primative. To fix this issue, we need to overhaul parts of the actual Ethereum consensus layer, and the underlying networking layer. A significant chunk of the work is already begun under the "Stateless Ethereum" effort that Alexey and I have been leading for the last 8 months. Some of it is being at least mitigated by the Geth team itself, via the SNAP syncing protocol they have been working on for the last year. And some of it still needs talented people to focus on gaining a deeper understanding of the problem and then figuring out a viable solution. Deconstructing the "Monolith" DevP2P ETH protocol is still partially unsolved. We have a basic understanding of how the network can be split into three separate special purpose networks, but currently there is nobody directly working on it. And then there are ideas like re-genesis which provide a mechanism to side-step some problems entirely. It's a radical approach that might gain us a lot of ground if it pans out. The primary thing that you should take away from this is that the Ethereum network has a lot of difficult work that needs to be done and there are a limited number of people with the skills necessary to do this work. More developers are onboarding every day but it takes time and dedication to learn all of the things needed to make meaningful contributions. And every EVM feature we add takes time away from client developers focusing on the lower level issues that are largely invisible to day-to-day users of the network. If we want the Ethereum network to succeed long term, I think we as a community need to come together around these issues, ensure that their underlying causes are getting the attention and discussion they need, and above all else focus on working together collaboratively towards meaningful technical solutions.