# OpenStack

## Context
* 25 OpenStack regions worldwide
* \>250k instances
* oldest regions are using OpenStack Juno
* newest regions are using OpenStack Newton
### When do we need live migration?
* Need to empty a compute node
* Hardware maintenance
* Software issue
* Kernel upgrade
* microcode update
* Spectre / meltdown / foreshadow
* Upgrade from Juno to Newton
* keep consistency between infra
* drop old releases
* bring new services
* Newton => easier to upgrade to newer releases
#### More live-migration => more bugs
## Context
* Environment
- Juno on Ubuntu Trusty
- Qemu 2.3 / libvirt 1.3.1
**We need an outstanding live-migration workflow!**
but that's not the case on vanilla OpenStack Juno...
## Multiple instance profiles
* Why our live-migration is painful
* local disk
* ceph disk
* Extra
* cinder volume
* config drive
* GPU
* FPGA
* different CPU generations
## Workflow
* A bug is triggered by live migration
- Try to reproduce it in an OpenStack dev environment
- Find root cause
- Check if an upstream fix already exists
- Cherry pick on our OpenStack env
- If not fixed upstream
- Need to be fixed quickly
- local workaround
- OpenStack upstream patch
### Live migration issues
* Fail to live migrate on loaded instances
- migration takes ages or never finishes
- instance needs to be paused (virsh suspend)
- reproduce case with stress-ng running on vm
```bash
# source host
$ virsh domjobinfo 9c2d1068-fe0b-4ee3-b860-c00c9688cf49
Job type: Unbounded
Time elapsed: 172795 ms
# Destination host, after instance suspend
$ virsh domjobinfo 9c2d1068-fe0b-4ee3-b860-c00c9688cf49 --completed
Job type: Completed
Time elapsed: 942 798 ms
Data processed: 572.901 GiB
Iteration: 224
...
```
### Live migration issues
* Enable auto convergence
- cpu throttling
=> **Need qemu 2.5 / libvirt 1.3.3**
```bash
/etc/nova/nova.conf
# Juno
block_migration_flag= ... , VIR_MIGRATE_AUTO_CONVERGE
live_migration_flag= ..., VIR_MIGRATE_AUTO_CONVERGE
# Newton
live_migration_permit_auto_converge=True
```
### Live migration issues
Juno uses libvirt method migrateToURI2
- We were unable to live migrate instances with both:
- local disk
- attached ceph volume
* Use migrateToURI3
=> **Need qemu 2.5 / libvirt 1.3.3**
### Live migration issues
config drive
```bash
$ nova boot --config-drive True ...
```
on compute host
```bash
$ ls
console.log disk disk.config disk.info libvirt.xml
```
bug in libvirt, if config_drive_format=iso9660, read-only file
Solution
```bash
src = "%s:%s/disk.config" % (instance['host'], instance_dir)
utils.execute('scp', src, instance_dir)
```
### Live migration issues
- Future improvements:
- OpenStack
- live migration with CPU Pinning
- migrate from node with qemu qcow2 disk => raw disk
- libvirt migration features:
- post copy (>= qemu 2.6)
- compression
## Automation - run-cli

Note: is-trusty kernel-version
## Automation - mistral

## Automation - mistral
* Workflows
* inputs
* tasks
* actions
* workflows
* outputs
## Automation - example workflow
```yaml
live-migrate:
description: A workflow that live migrates an instance
input:
- region
- vm
```
## Automation - example workflow
```yaml
...
ping_before:
action: ovh.shell
input:
cmd: "ping -c10 <% $.ip_to_ping %>"
# we will go to the next action depending on
# wheter or not the instance pings
publish:
pingable: true
publish-on-error:
pingable: false
on-complete:
- live_migrate
...
```
## Automation - example workflow
```yaml
live_migrate:
action: nova.servers_live_migrate
input:
action_region: <% $.region %>
server: <% $.vm.id %>
host: Null
block_migration: false # We first try without block_migration
disk_over_commit: true
retry:
delay: 5
count: 1
wait-after: 10
on-success:
- wait_vm_active
on-error:
- live_block_migrate # On error, we will try with block_migration
```
## Automation - example workflow
```yaml
live_block_migrate:
action: nova.servers_live_migrate
input:
action_region: <% $.region %>
server: <% $.vm.id %>
host: Null
block_migration: true # This is block_migration
disk_over_commit: true
retry:
delay: 5
count: 1
on-success:
- wait_vm_active
on-error:
- fail # It also failed with block_migration, so fail
```
## Monitoring / Stats - Alerting
Alerting probes deployed on compute nodes
```bash
[root@host772847.preprod.gra1.cloud.ovh.net ~] instance-routed --debug
[ DEBUG ] Scanning (icmp) following 4 ips:
[ DEBUG ] 10.92.135.33 10.92.135.77 10.92.135.78 10.92.135.147
[ DEBUG ] 1 ips did not reply:
[ DEBUG ] 10.92.135.78
[ DEBUG ] 10.92.135.78 does not have static route in ns-link!!
[ DEBUG ] Trying again MTR check:
[ DEBUG ] 10.92.135.78 does not have static route in ns-link!!
[ INFO ] 1 ips are not correcly routed:
[ DEBUG ] 10.92.135.78
[ INFO ] Set oco status: 210
[ DEBUG ] Write oco status: 210
```
- detect a possible post live migration issue
- cleanup on source node
- is everything well on destination ?
- Alerting handled by shinken
## Monitoring / Stats - Alerting
- Neutron
- missing / orphaned BGP announcement
- missing / orphaned openflow rules (private network)
- Port / OVS bridge / namespace created
- arping ok (instance has security rules)
## Monitoring / Stats - Alerting
- Nova
- disk not cleaned up on source
- Mechanism to autoclean orphaned disk
- at first, disabled on our production
- we enabled it, because it just works !
```
# nova.conf
running_deleted_instance_action=reap
running_deleted_instance_poll_interval=7200
running_deleted_instance_timeout=3600
```
## Monitoring / Stats
- collect data and push to timeseries backend
- Python script using libvirt api on compute nodes
- virsh domjobinfo after a succcessful migration
- Noderig (https://github.com/ovh/noderig)
- Beamium (https://github.com/ovh/beamium)
- Warp10 timeseries backend
## Monitoring / Stats - Grafana example


## Monitoring / Stats - Grafana example

## Juno to newton
#### a journey with OpenStack Upgrades
## The easy part - computes
Same libvirt / qemu version on Trusty and Xenial allowed for smooth upgrade on compute nodes
```bash
# OpenStack Juno on Trusty
$ apt-cache policy libvirt-bin
libvirt-bin:
Installed: 1.3.3-2ubuntu2+ovh1~trusty
# OpenStack Newton on Trusty
$ apt-cache policy libvirt-bin
libvirt-bin:
Installed: 1.3.3-2ubuntu2+ovh1~trusty
```
So, OpenStack upgrade is as easy as
```bash
apt-get upgrade # even easier when it's done by puppet
```
## The easy part - controllers
Almost all control plane upgrade is easy, just need to reinstall those servers to move to Xenial with OpenStack Newton
## except ...
## The hard part - databases
As expected by anybody who already upgraded OpenStack once...
Our solution
* Fast Forward Upgrade
* Docker containers to perform alembic migrations
* A good documentation
* A working backup
* A way to rollback
## Conclusion
* Backporting code to EOL OpenStack releases
* Automation is what you need
* Juno to Newton easier when core packages are shared