substrate-etl Retroactive funding: Q3+Q4 2022-Q1+Q2 2023
substrate-etl is a large scale index of Polkadot + Kusama and their parachains based on Polkaholic.io, a multichain block explorer/indexer. Using this data warehouse, anyone can query decoded blocks, extrinsics, events, calls across both ecosystems.
We onboarded Polkadot Data into Google Public Datasets, announced in September 2023:
- Enhancing Google Cloud’s blockchain data offering with 11 new chains in BigQuery
- Polkadot - Harness the power of Polkadot's BigQuery Public Dataset
The substrate-etl index results in two large datasets (crypto_polkadot and crypto_kusama) covering the 2 relay chains and over 70 parachains. This represents Polkadot/Kusama alongside other Web3 ecosystems and supports complex large scale analyses with SQL using BigQuery's lightning-fast query performance on extremely large datasets. Having substrate-etl enables Dune-like analytics to guide business intelligence, inform engineering decision making, and showcase ecosystem growth.
Power users include Parity Data (led by Karim Jedda and Pavla Mijic), Subwallet/DotInsights, Web3metrics,
Dolpha.com, ChainViz and many data analysts within parachains and outside the ecosystem.
Quick Start
For a full guide, see substrate-etl, but you can get started with the following Analytics Hub links:
Roadmap
In addition to continuing support for substrate-etl/Polkaholic.io for existing and future chains, our plans for 2024 are to:
- chart a course within the Decentralized Futures for 2024, supporting Web3 Foundation's Data Analysis Tools for Substrate-based Blockchains (e.g. traces and staking), chart a course for substrate-etl + Polkaholic APIs to support Dune Analytics Style Data Service for Polkadot/Kusama with leading teams in the ecosystem
- develop + support a Polkadot Builder Collective with a "builders supporting builders" orientation
- build support for CoreJam+CorePlay+DA and extend support for ink!/WASM contracts
- extend substrate-etl to support raw + fully-decoded EVM datasets (transactions+logs), enabling Substrate+EVM hybrid chain analytics, as more parachains incorporate hybrid chain designs
Proposal Details
Requested: 99,025 DOT
About Colorful Notion
Colorful Notion is led by Sourabh Niyogi and Michael Chung, based in the SF Bay Area. We have been active in the Polkadot ecosystem over the last 2 years and look forward to building with others in CoreJam/Polkadot 2.0’s “map reduce” ubiquitous computing future.
We are grateful for everyone who has helped us and hope we can pay it forward by helping others
Comments (7)
Requested

Proposal Passed
Summary
0%
Aye
0%
Nay
Aye (69)0.0 DOT
Support0.0 DOT
Nay (78)0.0 DOT
Your web site seems so rough and why we need you, we already have Subscan.
Thanks for your note. However, see this for an alternative viewpoint. We have the highest respect for Subscan and have focussed on the substrate-etl / indexing + analytics instead of replicating their level of polish.
To address the roughness differently, we are looking to work with new partners like ChainIDE (after a successful collaboration with them on ink! WASM Contract Explorer this summer), and in general designing a strategy for dealing with the complexity of hybrid Substrate+EVM chains, CoreTime/CoreJam/CorePlay, and more of a Dune Analytics type future for Polkaholic.io which requires a data warehouse backend.
On the "why we need X, we already have Y", I am so glad the Polkadot founders didn't listen to the people who said "why we do we need Polkadot, we already have Ethereum" -- they didn't and our subsection of the world is better for it. Its always important for there to be people who push the boundaries to do new things and for me, we see our last year as training us to do new things in the CoreXYZ future and I can't convey enough how committed we are to doing something with more impact than just building a block explorer with a little more polish and indexer alone. We would like to build high growth high scale solutions for millions of people here and did not show up to build a web site.
Hi,
Thanks for the proposal, some questions below.
Overall, although I find the direction useful, your costs seem to be exceeding 600K USD per year, even more if you have received other funding either from the treasury or the W3F, and I find this total very unconvincing for the product provided.
Thanks,
Kutsal | Helikon
@helikon Very awesome questions
(a) We have an open W3F application on Deep Account Analytics and delivered Milestone 1 + Milestone 2 for $30K. The W3F grant process does not support running production applications with high compute/storage to serve the community, but this may change or not. (b) We have completed a ChainIDE+Polkaholic WASM Contract Explorer Integration.
We worked on 1(a)+(b) in Q2+Q3 2023, and adjusted our cost model for Q2 2023 accordingly (and would adjust Q3 2023/Q4 2023 if we have to do this Treasury). I do think its important that you ask this type of accountability question and check the math so there isn't any double counting by any team, I do appreciate what you are doing, its very important.
We did not receive any previous Treasury grants for this. We did attempt one Polkaholic Treasury proposal in Kusama OpenGov1 (got 8 AYEs and 0 NAYs, insufficient to pass, 4 council members were too busy) but we decided after decoded 2022 to focus on the substrate-etl/analytics side and work with Parity Data on a Polkadot Data Alliance bounty at Parity Data's direction but it did not materialize this Spring/Summer and given the "decentralize Parity" it is on hold until decentralisation activities take concrete shape.
TJ+FN have no web presence -- but since you ask, TJ is our systems engineer Tapas who keeps this running -- runs the Google Cloud with a sizeable GKE cluster of substrate archive nodes+crawlers; FN (Francesca is my wife) makes sure Google + our team gets paid is part-time admin but in a literal sense has been funding this and is our food+water+cloud. Some people are amazed we can do so much with a 4 person team but we are hardly the only teams like this.
Q3 involved SN+MC (a) working on W3F grants referenced in (1)(a); (b) the inkubator program referenced in (1)(b); (c) a parallel effort called evm-etl where we decode EVM Chains like Ethereum, Arbitrum, Optimism, Base (which we started in mid-May), which can also be applied to Moonbeam, Astar, etc. but was of secondary importance given the relative DAUs. This month, we were asked by Pavla to model the small but growing number of hybrid chains in substrate-etl, which we are now ready to do given (c). We got paid for (a) in Nov but did not get paid for (b) and (c) is far more of an EVM project than a Substrate project -- we wished to have a basic 1.0 solution that works across Substrate+EVM ecosystems and account for (a-c) payments before modeling Q3+Q4 2023 completely. At the end of Q2, we split up our Google Cloud projects into substrate and non-substrate to model Colorful Notion Polkadot and non-Polkadot activities but it is a bit tricky with hybrid Substrate+EVM chains having 2 chain lives (1284 vs 2004, 592 vs 2006 etc). As far as a forecast of yearly costs going forward, we made substrate-etl run smoothly so that it doesn't need both SN+MC working on it full time but there is maintenance (TJ+GC+.25 SN) and then there is innovation (SN+MC+X FTEs doing new things). We are looking to chart a course for innovation with W3F now but the maintenance of that innovation may still reside over here in the Treasury. I am not the architect of this, this is just what they (W3F+Parity) tell me to do...
Colorful Notion's relevant repos are Polkaholic and substrate-etl
substrate-etl documents the "decoded" datasets but generally a SQL knowledgable data analyst has to spend a few minutes to an hour to map those decoded datasets into SQL for any specific question. For high interest "deep" activity (staking+democracy now, assetconversion soon, coretime+coreplay later) it is reasonable to build special purpose pallet-centric datasets or area specific datasets like "dextrades" that take the decoded datasets to the next level and not simply bundle them in defillama dashboards or block explorer charts alone. For your specific question on staking nominators, join the Polkadot Data Alliance Matrix chat room and/or email michael@colorfulnotion.com -- your question likely has a new dataset output of polkadot_analytics which we are happy to figure out a specific plan (as in here is the pallet-specific data for polkadot) and general (how should multichain pallets + domains like "dextrade" be analyzed across the ecosystem) together.
On the overall idea that we are 120K DOT/yr in aggregate of all of this (with GC being a significant component), that's accurate. I'm proud of the work we've done as a 4 person team and think we can do more, not to just maintain+extend this, but to innovate. I hope the combination of Treasury+W3F can help us chart a course on the innovation possibilities in 2.0's CoreJam+CorePlay+CoreTime with this work as a foundation.