Skip to content

Commit

Permalink
Merge branch 'master' into feat/custom-moldbs
Browse files Browse the repository at this point in the history
# Conflicts:
#	metaspace/graphql/src/modules/dataset/controller/Mutation.publishing.spec.ts
#	metaspace/graphql/src/modules/dataset/controller/Mutation.ts
#	metaspace/graphql/src/tests/testDataCreation.ts
#	metaspace/graphql/yarn.lock
#	metaspace/webapp/src/router.ts
#	metaspace/webapp/yarn.lock
  • Loading branch information
intsco committed Jun 5, 2020
2 parents fa38768 + e0c4ba5 commit 6049cc1
Show file tree
Hide file tree
Showing 161 changed files with 8,081 additions and 6,794 deletions.
2 changes: 1 addition & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ jobs:
- run:
name: Upload coverage
command: |
yarn run test --coverage
yarn run test --coverage --maxWorkers=1
npx codecov -p ../.. -F graphql
Expand Down
3 changes: 2 additions & 1 deletion ansible/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
# Ansible project for setting up machines running SM platform

## Installation Types Supported
* [Virtual Box Installation](vbox/README.md)
* ~[Vagrant virtualbox Installation](vbox/README.md)~ (Removed in favor of Docker installations)
* [AWS Installation](aws/README.md)
* [Docker installation](../docker/README.md)

## Funding

Expand Down
284 changes: 284 additions & 0 deletions ansible/aws/PRODUCTION_UPGRADE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,284 @@
# Review changes since the last deployment

https://github.com/metaspace2020/metaspace/compare/release...master

Review all outstanding changes. If you're unfamiliar with any changes,
ask the author if there are any manual steps needed. In addition

### Main places to check for changes:

#### `metaspace/engine/sm/engine/es_export.py`

If new fields are added to ElasticSearch:
* It will be necessary to manually update the ElasticSearch index during deployment
* Ensure that sm-graphql has fallback logic for when the new fields aren't yet populated.
Note that if graphql returns `null`/`undefined` for a non-nullable field, the whole query will fail.
This can easily break the Datasets or Annotations pages.

Also note that this behavior may be hidden during local development - the `graphqlMocks` feature flag
replaces `undefined` return values with random test data.

If any of the existing field mappings (defined in `ESIndexManager.create_index`) are changed,
it will be necessary to do a full rebuild of ElasticSearch. Try to avoid this, as it costs a lot of time.

#### `metaspace/graphql/src/migrations/`

Usually these are run automatically when the sm-graphql service restarts. Just be aware when they exist.
They don't always succeed, and they occasionally need to be monkey-patched to fix a deployment.

#### `metaspace/engine/migrations/`

These migations have to be run manually. Check with the author how to run them.

#### Config file and Ansible deployment script changes

It's always hard to know if these changes will deploy safely. Review them before deploying so you know
where to start looking if something goes wrong.

#### Test status in the `master` branch

Make sure `webapp` and `graphql` builds are passing in the `master` branch in CircleCI.
It's possible for PRs that pass all tests to break the build after merging,
e.g. if a function the PR depends on is renamed after the PR branches from master.

# Choose a deployment strategy

Select one of the following based on whether the new code is compatible with the existing data.
We should not have more than 1 minute of downtime without at least a visible message.

Copy these checklists into a new task if desired, or if any customization of the process is needed.

#### In-place deployment

If there are no significant changes to ElasticSearch or Postgres:

- [ ] Let the #metaspace_dev slack channel know you're starting deployment.
- [ ] Run [the Ansible web deployment](README.md).
- [ ] Let the #metaspace_dev slack channel know that deployment was successful.

#### Deployment followed by ElasticSearch update

If there are new fields in ElasticSearch, but it's ok for them to be populated over the course of several days:

- [ ] Let the #metaspace_dev slack channel know you're starting deployment.
- [ ] Run [the Ansible web deployment](README.md).
- [ ] Run an [ElasticSearch incremental update](#es-update).
- [ ] Let the #metaspace_dev slack channel know that deployment was successful once the ElasticSearch update is running.

#### ElasticSearch reindex before deployment

If there are new fields in ElasticSearch that are necessary for the new code:

- [ ] Check out the new code into a temp directory on the server.
- [ ] Create an inactive ElasticSearch index.
- [ ] Use the new code to reindex into the inactive index. This can take multiple days.
- [ ] Turn off dataset processing in https://metaspace2020.eu/admin/health
- [ ] Run a partial ElasticSearch update in the inactive index for any datasets that were created while indexing.
This is just to prevent users from wondering "Where is my data?" for recently submitted datasets.
- [ ] Let the #metaspace_dev slack channel know you're starting deployment.
- [ ] Swap the inactive index with the active index.
- [ ] Deploy the new code.
- [ ] Turn dataset reprocessing back on.
- [ ] Run a full incremental update just in case an old dataset was updated and its changes weren't
propagated to the new ElasticSearch index.
- [ ] Let the #metaspace_dev slack channel know that deployment was successful once the ElasticSearch update is running.
- [ ] Delete the old index (now the inactive index).
- [ ] Delete the temp directory containing the new code.

#### Fork the VM, update, then swap to the new VM

If there are DB or infrastructure changes that require substantial downtime.

- [ ] Let the #metaspace_dev slack channel know you're starting deployment.
- [ ] Turn METASPACE to read-only mode in https://metaspace2020.eu/admin/health
- [ ] Use AWS to snapshot the EC2 instance, then create a new instance from the snapshot.
* Copy all the properties from the previous instance, and make sure Termination Protection is turned on.
- [ ] Update your Ansible `/env/prod/hosts` file to link to the IP address of the new instance.
- [ ] Deploy to the new VM and apply the migrations.
- [ ] Swap the Elastic IP address for metaspace2020.eu to point to the new VM.
- [ ] Confirm everything is working on the new instance, then turn off read-only mode.
- [ ] Shut down the old instance.
- [ ] Let the #metaspace_dev slack channel know that deployment was successful once the ElasticSearch update is running.
- [ ] Terminate the old instance once you're happy that the migration has succeeded.
- [ ] Revert your ansible `/env/prod/hosts` change, as the new instance now has the old instance's public IP address.


# ElasticSearch Commands

Prerequisite environment setup:
```bash
ssh ubuntu@metaspace2020.eu # Connect to the server
cd /opt/dev/metaspace/metaspace/engine
source activate sm # Activate the sm Python environment
# This will probably warn about "/usr/local/bin/deactivate". Ignore it.
```
## Managing the indexes

#### Check status of indexes

```bash
python -m scripts.manage_es_index status
```

#### Create inactive index

```bash
python -m scripts.manage_es_index --inactive create
```

#### Swap inactive and active indexes

```bash
python -m scripts.manage_es_index swap
```

#### Drop inactive index

Always use `status` to check that the index to drop is `inactive` before running this

```bash
python -m scripts.manage_es_index --inactive drop
```

## Reindexing

NOTE: The default SSH configuration will lose connection to the server after a period of idleness.
This can cause these long-running jobs to be terminated. It's a good idea to run all of these commands from
within a `tmux` shell, so that they continue running after a disconnection, and can be re-opened after reconnection.

[Tmux Cheat Sheet](https://tmuxcheatsheet.com/)

If you enter "copy mode" by scrolling or selecting text, make sure to exit copy mode afterwards,
because the running process will be stalled during copy mode.

#### <a name="es-update"></a>Full in-place update

This will update the existing documents in-place.

```bash
nice python -m scripts.update_es_index --ds-name "%' ORDER BY id DESC; --"
# (Yes, this uses SQL injection... It's just easier to work with SQL than making 10s of command line options)
```

#### <a name="es-recent"></a>Partial in-place update for recent datasets

Change the date in the query to the desired earliest date to update.

```bash
nice python -m scripts.update_es_index --ds-name "%' AND id > '2020-05-18'; --"
```

#### <a name="es-reindex"></a>Offline reindex

Run each line individually & check the results.

```bash
# Check existing indexes
python -m scripts.manage_es_index status

# If there's an inactive index, drop it
python -m scripts.manage_es_index --inactive drop

# Create inactive index
python -m scripts.manage_es_index --inactive create

# Populate inactive index (this can take several days)
nice python -m scripts.update_es_index --inactive --ds-name "%' ORDER BY id DESC; --"
# Disable dataset processing once this is done

# Add datasets that were created after the reindexing started (Change the date to when you started this process)
nice python -m scripts.update_es_index --inactive --ds-name "%' AND id > '2020-05-18'; --"

# Swap inactive and active indexes
python -m scripts.manage_es_index swap

# Deploy new code & check that it's working
# Re-enable dataset processing

# Do a full incremental index update, in case any datasets were missed
nice python -m scripts.update_es_index --ds-name "%' ORDER BY id DESC; --"

# Once you're satisfied it's safe, drop the old index
python -m scripts.manage_es_index --inactive drop
```

# Post-deployment checks

* Check that https://metaspace2020.eu/datasets and https://metaspace2020.eu/annotations work and show data.

* If there have been any changes to the annotation code or cluster configuration,
submit a [test dataset](metaspace/engine/tests/data/untreated) to check that annotation still works correctly.

# Troubleshooting

## Manage services with Supervisor

```bash
supervisorctl status
```
> ```
> sm-api RUNNING pid 23950, uptime 26 days, 23:39:14
> sm-cluster-autostart RUNNING pid 26146, uptime 26 days, 2:21:22
> sm-graphql RUNNING pid 20267, uptime 6 days, 0:48:08
> sm-update-daemon RUNNING pid 23947, uptime 26 days, 23:39:14
> ```
If any service isn't `RUNNING`, first try restarting it, e.g. for sm-graphql:
```bash
supervisorctl restart sm-graphql
```

Check the logs if it won't stay running:
```bash
supervisorctl tail -10000 sm-graphql
```

Alternatively you can use `less` to browse the logs on the filesystem:

```bash
less /opt/dev/metaspace/metaspace/graphql/logs/sm-graphql.log
less /opt/dev/metaspace/metaspace/engine/logs/sm-api.log
less /opt/dev/metaspace/metaspace/engine/logs/sm-update-daemon.log
```

## Postgres

If you need to make manual database fixes, either use a dedicated database client (e.g. DataGrip), or the command line.
It's really easy to destroy data this way, so don't do this unless you've made an AWS snapshot of the VM,
or you're confident in your SQL skills.

You can start an SQL prompt with `sudo -u postgres psql sm postgres`

## System services

Check statuses & recent logs:
```bash
sudo systemctl status
sudo systemctl status nginx
sudo systemctl status postgres
sudo systemctl status elasticsearch
```

Reload nginx config:
```bash
sudo nginx -s reload
```

Restart services:
```bash
sudo systemctl restart nginx
sudo systemctl restart postgres
sudo systemctl restart elasticsearch
```

#### Logs

Most logs are in `/var/log` but are in protected directories.
You may wish to `sudo su` so that you can more easily browse the filesystem. Don't forget to `exit` superuser mode
once you're done.

```bash
sudo tail /var/log/nginx/error.log
sudo tail /var/log/elasticsearch/elasticsearch.log
sudo tail /var/log/postgresql/postgresql-9.5-main.log
```
22 changes: 11 additions & 11 deletions ansible/aws/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ Specify
* Passwords
* `hostgroup` values for all types of instances

in dev/group_vars/all/vars.yml and dev/group_vars/all/vault.yml
in env/dev/group_vars/all/vars.yml and env/dev/group_vars/all/vault.yml

Values to be updated are capitalized.

Expand All @@ -56,20 +56,20 @@ Values to be updated are capitalized.
You will need at least three: main instance for the web app, database, RabbitMQ, and Elasticsearch;
Spark master and Spark slave instances.

`ansible-playbook -i dev aws_start.yml -e "components=all"`
`ansible-playbook -i env/dev aws_start.yml -e "components=all"`

#### Provision instances

Provision the web services and Spark cluster instances

```
ansible-playbook -i dev provision/web_server.yml
ansible-playbook -i dev provision/spark_cluster.yml
ansible-playbook -i env/dev provision/web.yml
ansible-playbook -i env/dev provision/spark.yml
```

#### Create custom AMIs for Spark master and slave instances

`ansible-playbook -i dev create_ami.yml`
`ansible-playbook -i env/dev create_ami.yml`

This step will take a while.
Once the playbook is finished replace AMI ids for master ans slave instances with new ones in the dev/group_vars/all/vars.yml.
Expand All @@ -80,25 +80,25 @@ New AMI ids can be found in the AWS Console.
After AMIs were successfully created stop Spark instances.
They will be started automatically from new AMIs after a dataset is uploaded.

`ansible-playbook -i dev aws_stop.yml -e "components=master,slave"`
`ansible-playbook -i env/dev aws_stop.yml -e "components=master,slave"`

#### Deploy and start the web app and other services

`ansible-playbook -i dev deploy/web_server.yml`
`ansible-playbook -i env/dev deploy/web.yml`

## Start/Stop instances manually

Start all instances

```
ansible-playbook -i dev aws_start.yml -e "components=all"
ansible-playbook -i dev aws_cluster_setup.yml
ansible-playbook -i env/dev aws_start.yml -e "components=all"
ansible-playbook -i env/dev aws_cluster_setup.yml
```

Deploy and start the web application and other services

`ansible-playbook -i dev deploy/web_server.yml`
`ansible-playbook -i env/dev deploy/web.yml`

To Stop all SM platform instances execute

`ansible-playbook -i dev aws_stop.yml -e "components=all"`
`ansible-playbook -i env/dev aws_stop.yml -e "components=all"`
Loading

0 comments on commit 6049cc1

Please sign in to comment.