7.2 KiB
@0xproject/pipeline
This repository contains scripts used for scraping data from the Ethereum blockchain into SQL tables for analysis by the 0x team.
Contributing
We strongly recommend that the community help us make improvements and determine the future direction of the protocol. To report bugs within this package, please create an issue in this repository.
Please read our contribution guidelines before getting started.
Install dependencies:
yarn install
Build
yarn build
Clean
yarn clean
Lint
yarn lint
Migrations
Create a new migration: yarn migrate:create --name MigrationNameInCamelCase
Run migrations: yarn migrate:run
Revert the most recent migration (CAUTION: may result in data loss!): yarn migrate:revert
Testing
There are several test scripts in package.json. You can run all the tests
with yarn test:all
or run certain tests seprately by following the
instructions below. Some tests may not work out of the box on certain platforms
or operating systems (see the "Database tests" section below).
Unit tests
The unit tests can be run with yarn test
. These tests don't depend on any
services or databases and will run in any environment that can run Node.
Database tests
Database integration tests can be run with yarn test:db
. These tests will
attempt to automatically spin up a Postgres database via Docker. If this doesn't
work you have two other options:
- Set the
DOCKER_SOCKET
environment variable to a valid socket path to use for communicating with Docker. - Start Postgres manually and set the
ZEROEX_DATA_PIPELINE_TEST_DB_URL
environment variable. If this is set, the tests will use your existing Postgres database instead of trying to create one with Docker.
Running locally
pipeline
requires access to a PostgreSQL database. The easiest way to start
Postgres is via Docker. Depending on your platform, you may need to prepend
sudo
to the following command:
docker run --rm -d -p 5432:5432 --name pipeline_postgres postgres:11-alpine
This will start a Postgres server with the default username and database name
(postgres
and postgres
). You should set the environment variable as follows:
export ZEROEX_DATA_PIPELINE_DB_URL=postgresql://postgres@localhost/postgres
First thing you will need to do is run the migrations:
yarn migrate:run
Now you can run scripts locally:
node packages/pipeline/lib/src/scripts/pull_radar_relay_orders.js
To stop the Postgres server (you may need to add sudo
):
docker stop pipeline_postgres
This will remove all data from the database.
If you prefer, you can also install Postgres with e.g.,
Homebrew or
Postgress.app. Keep in mind that you will need to
set theZEROEX_DATA_PIPELINE_DB_URL
environment variable to a valid
PostgreSQL connection url
Directory structure
.
├── lib: Code generated by the TypeScript compiler. Don't edit this directly.
├── migrations: Code for creating and updating database schemas.
├── node_modules:
├── src: All TypeScript source code.
│ ├── data_sources: Code responsible for getting raw data, typically from a third-party source.
│ ├── entities: TypeORM entities which closely mirror our database schemas. Some other ORMs call these "models".
│ ├── parsers: Code for converting raw data into entities.
│ ├── scripts: Executable scripts which put all the pieces together.
│ └── utils: Various utils used across packages/files.
├── test: All tests go here and are organized in the same way as the folder/file that they test.
Adding new data to the pipeline
- Create an entity in the entities directory. Entities directly mirror our database schemas. We follow the practice of having "dumb" entities, so entity classes should typically not have any methods.
- Create a migration using the
yarn migrate:create
command. Create/update tables as needed. Remember to fill in both theup
anddown
methods. Try to avoid data loss as much as possible in your migrations. - Add basic tests for your entity and migrations to the test/entities/ directory.
- Create a class or function in the data_sources/ directory for getting raw data. This code should abstract away pagination and rate-limiting as much as possible.
- Create a class or function in the parsers/ directory for converting the raw data into an entity. Also add tests in the tests/ directory to test the parser.
- Create an executable script in the scripts/ directory for putting everything together. Your script can accept environment variables for things like API keys. It should pull the data, parse it, and save it to the database. Scripts should be idempotent and atomic (when possible). What this means is that your script may be responsible for determining which data needs to be updated. For example, you may need to query the database to find the most recent block number that we have already pulled, then pull new data starting from that block number.
- Run the migrations and then run your new script locally and verify it works as expected.
- After all tests pass and you can run the script locally, open a new PR to the monorepo. Don't merge this yet!
- If you added any new scripts or dependencies between scripts, you will need to make changes to https://github.com/0xProject/0x-pipeline-orchestration and make a separate PR there. Don't merge this yet!
- After your PR passes code review, ask @feuGeneA or @xianny to deploy your changes to the QA environment. Check the QA Airflow dashboard to make sure everything works correctly in the QA environment.
- Merge your PR to 0x-monorepo (and https://github.com/0xProject/0x-pipeline-orchestration if needed). Then ask @feuGeneA or @xianny to deploy to production.
- Monitor the production Airflow dashboard to make sure everything still works.
- Celebrate! 🎉
Additional guidelines and tips:
- Table names should be plural and separated by underscores (e.g.,
exchange_fill_events
). - Any table which contains data which comes directly from a third-party source
should be namespaced in the
raw
PostgreSQL schema. - Column names in the database should be separated by underscores (e.g.,
maker_asset_type
). - Field names in entity classes (like any other fields in TypeScript) should
be camel-cased (e.g.,
makerAssetType
). - All timestamps should be stored as milliseconds since the Unix Epoch.
- Use the
BigNumber
type for TypeScript code which deals with 256-bit numbers from smart contracts or for any case where we are dealing with large floating point numbers. - TypeORM documentation is pretty robust and can be a helpful resource.
- Scripts/parsers should perform minimum data transformation/normalization. The idea here is to have a raw data feed that will be cleaned up and synthesized in a separate step.