185 lines
7.2 KiB
Markdown
185 lines
7.2 KiB
Markdown
## @0xproject/pipeline
|
|
|
|
This repository contains scripts used for scraping data from the Ethereum blockchain into SQL tables for analysis by the 0x team.
|
|
|
|
## Contributing
|
|
|
|
We strongly recommend that the community help us make improvements and determine the future direction of the protocol. To report bugs within this package, please create an issue in this repository.
|
|
|
|
Please read our [contribution guidelines](../../CONTRIBUTING.md) before getting started.
|
|
|
|
### Install dependencies:
|
|
|
|
```bash
|
|
yarn install
|
|
```
|
|
|
|
### Build
|
|
|
|
```bash
|
|
yarn build
|
|
```
|
|
|
|
### Clean
|
|
|
|
```bash
|
|
yarn clean
|
|
```
|
|
|
|
### Lint
|
|
|
|
```bash
|
|
yarn lint
|
|
```
|
|
|
|
### Migrations
|
|
|
|
Create a new migration: `yarn migrate:create --name MigrationNameInCamelCase`
|
|
Run migrations: `yarn migrate:run`
|
|
Revert the most recent migration (CAUTION: may result in data loss!): `yarn migrate:revert`
|
|
|
|
## Testing
|
|
|
|
There are several test scripts in **package.json**. You can run all the tests
|
|
with `yarn test:all` or run certain tests seprately by following the
|
|
instructions below. Some tests may not work out of the box on certain platforms
|
|
or operating systems (see the "Database tests" section below).
|
|
|
|
### Unit tests
|
|
|
|
The unit tests can be run with `yarn test`. These tests don't depend on any
|
|
services or databases and will run in any environment that can run Node.
|
|
|
|
### Database tests
|
|
|
|
Database integration tests can be run with `yarn test:db`. These tests will
|
|
attempt to automatically spin up a Postgres database via Docker. If this doesn't
|
|
work you have two other options:
|
|
|
|
1. Set the `DOCKER_SOCKET` environment variable to a valid socket path to use
|
|
for communicating with Docker.
|
|
2. Start Postgres manually and set the `ZEROEX_DATA_PIPELINE_TEST_DB_URL`
|
|
environment variable. If this is set, the tests will use your existing
|
|
Postgres database instead of trying to create one with Docker.
|
|
|
|
## Running locally
|
|
|
|
`pipeline` requires access to a PostgreSQL database. The easiest way to start
|
|
Postgres is via Docker. Depending on your platform, you may need to prepend
|
|
`sudo` to the following command:
|
|
|
|
```
|
|
docker run --rm -d -p 5432:5432 --name pipeline_postgres postgres:11-alpine
|
|
```
|
|
|
|
This will start a Postgres server with the default username and database name
|
|
(`postgres` and `postgres`). You should set the environment variable as follows:
|
|
|
|
```
|
|
export ZEROEX_DATA_PIPELINE_DB_URL=postgresql://postgres@localhost/postgres
|
|
```
|
|
|
|
First thing you will need to do is run the migrations:
|
|
|
|
```
|
|
yarn migrate:run
|
|
```
|
|
|
|
Now you can run scripts locally:
|
|
|
|
```
|
|
node packages/pipeline/lib/src/scripts/pull_radar_relay_orders.js
|
|
```
|
|
|
|
To stop the Postgres server (you may need to add `sudo`):
|
|
|
|
```
|
|
docker stop pipeline_postgres
|
|
```
|
|
|
|
This will remove all data from the database.
|
|
|
|
If you prefer, you can also install Postgres with e.g.,
|
|
[Homebrew](https://wiki.postgresql.org/wiki/Homebrew) or
|
|
[Postgress.app](https://postgresapp.com/). Keep in mind that you will need to
|
|
set the`ZEROEX_DATA_PIPELINE_DB_URL` environment variable to a valid
|
|
[PostgreSQL connection url](https://stackoverflow.com/questions/3582552/postgresql-connection-url)
|
|
|
|
## Directory structure
|
|
|
|
```
|
|
.
|
|
├── lib: Code generated by the TypeScript compiler. Don't edit this directly.
|
|
├── migrations: Code for creating and updating database schemas.
|
|
├── node_modules:
|
|
├── src: All TypeScript source code.
|
|
│ ├── data_sources: Code responsible for getting raw data, typically from a third-party source.
|
|
│ ├── entities: TypeORM entities which closely mirror our database schemas. Some other ORMs call these "models".
|
|
│ ├── parsers: Code for converting raw data into entities.
|
|
│ ├── scripts: Executable scripts which put all the pieces together.
|
|
│ └── utils: Various utils used across packages/files.
|
|
├── test: All tests go here and are organized in the same way as the folder/file that they test.
|
|
```
|
|
|
|
## Adding new data to the pipeline
|
|
|
|
1. Create an entity in the _entities_ directory. Entities directly mirror our
|
|
database schemas. We follow the practice of having "dumb" entities, so
|
|
entity classes should typically not have any methods.
|
|
2. Create a migration using the `yarn migrate:create` command. Create/update
|
|
tables as needed. Remember to fill in both the `up` and `down` methods. Try
|
|
to avoid data loss as much as possible in your migrations.
|
|
3. Add basic tests for your entity and migrations to the **test/entities/**
|
|
directory.
|
|
4. Create a class or function in the **data_sources/** directory for getting
|
|
raw data. This code should abstract away pagination and rate-limiting as
|
|
much as possible.
|
|
5. Create a class or function in the **parsers/** directory for converting the
|
|
raw data into an entity. Also add tests in the **tests/** directory to test
|
|
the parser.
|
|
6. Create an executable script in the **scripts/** directory for putting
|
|
everything together. Your script can accept environment variables for things
|
|
like API keys. It should pull the data, parse it, and save it to the
|
|
database. Scripts should be idempotent and atomic (when possible). What this
|
|
means is that your script may be responsible for determining _which_ data
|
|
needs to be updated. For example, you may need to query the database to find
|
|
the most recent block number that we have already pulled, then pull new data
|
|
starting from that block number.
|
|
7. Run the migrations and then run your new script locally and verify it works
|
|
as expected.
|
|
8. After all tests pass and you can run the script locally, open a new PR to
|
|
the monorepo. Don't merge this yet!
|
|
9. If you added any new scripts or dependencies between scripts, you will need
|
|
to make changes to https://github.com/0xProject/0x-pipeline-orchestration
|
|
and make a separate PR there. Don't merge this yet!
|
|
10. After your PR passes code review, ask @feuGeneA or @xianny to deploy your
|
|
changes to the QA environment. Check the [QA Airflow dashboard](http://airflow-qa.0x.org:8080)
|
|
to make sure everything works correctly in the QA environment.
|
|
11. Merge your PR to 0x-monorepo (and
|
|
https://github.com/0xProject/0x-pipeline-orchestration if needed). Then ask
|
|
@feuGeneA or @xianny to deploy to production.
|
|
12. Monitor the [production Airflow dashboard](http://airflow.0x.org:8080) to
|
|
make sure everything still works.
|
|
13. Celebrate! :tada:
|
|
|
|
#### Additional guidelines and tips:
|
|
|
|
- Table names should be plural and separated by underscores (e.g.,
|
|
`exchange_fill_events`).
|
|
- Any table which contains data which comes directly from a third-party source
|
|
should be namespaced in the `raw` PostgreSQL schema.
|
|
- Column names in the database should be separated by underscores (e.g.,
|
|
`maker_asset_type`).
|
|
- Field names in entity classes (like any other fields in TypeScript) should
|
|
be camel-cased (e.g., `makerAssetType`).
|
|
- All timestamps should be stored as milliseconds since the Unix Epoch.
|
|
- Use the `BigNumber` type for TypeScript code which deals with 256-bit
|
|
numbers from smart contracts or for any case where we are dealing with large
|
|
floating point numbers.
|
|
- [TypeORM documentation](http://typeorm.io/#/) is pretty robust and can be a
|
|
helpful resource.
|
|
|
|
* Scripts/parsers should perform minimum data transformation/normalization.
|
|
The idea here is to have a raw data feed that will be cleaned up and
|
|
synthesized in a separate step.
|