Hi!

First you might be interested by our Roadmap. This page is about deep technical & architectural challenge we have.

Sharing challenges feels like the right thing to do as I get so much from the open-source community. If this can help people to better understand what we are building here, I'm glad to share it.

NOTE: The text below is written by a voice recognition software. It's might look funny and is not edited by a human.


Table of Content



Backlog

πŸ™Š Container as an external hard drive

User stories / specs

As a DevOps hero:

  • As a DevOps hero, I'm looking for a nfs/zfs/GlusterFS or whatever application that mounts a common directory between all my nodes.
  • This needs to run as a docker service create XYZ --global with Docker Swarm. No manual configs on each node and no hard IP to set up.
  • As a DevOps hero, I want to create a new node on my existing cluster. The data should sync automatically.
  • As a DevOps hero, I want to have a common directory (not a docker volume) that all nodes can share. Something like /mnt/shared/permdata/

Per example, I would use it this way:

  • /mnt/shared/permdata/app1/
  • /mnt/shared/permdata/app2/
  • /mnt/shared/permdata/bkp/
  • /mnt/shared/permdata/etc/

Work around

At the moment I use Resilio which is great. The thing I don't like is the fact that it use the the public network to sync. There is no need for this. I want my service to use only the swarm network of my choice.

Maybe I could force resilio to sync only within an overlay network?

by: Pascal Andy / 2019-02-26

πŸ™Š Cluster crash mitigation

EDIT: 2019-02-26_15h05: The scenario below is well managed. It's not in prod yet only because we don't have a lot of nodes at the moment. It would be too costly for now. But everything is in place to make it work very quickly.

Scenario:

This is a big one. Let's say a whole cluster is not available for 6 hours. Whatever the reason. Shall we, as a business, cry on Twitter that our server vendor are down? Absolutely not! Remember the S3 crash in April 2017? Shit happens and I don't want this to happen to us at FirePress.

The idea here is that we would have two independent clusters running in two zones (data centre).

  • 50% of our clients are in NYC
  • 50% of our clients are in AMS

Let's say NYC crash. Fuck. OK no panic.

Deploy 100% of our clients to AMS.

The challenge is to this very quickly. Database merging + picture merging.

Then, went things are back to normal, redistribute 50%/50%.

With this setup, it also allows an easy transition from one cluster to a new one. I love it. Don't patch. Scrap and start from scratch.

πŸ™Š Roadmap

https://trello.com/b/0fCwwzqc/firepress-roadmap

Go back to Table of content


Dropped

Caching website / blogs

  • Challenge β€” Add a Varnish caching container for each blog (or maybe one for every domain we host??)
  • CMO, a request goes to Traefik CNT > Ghost CTN > MySQL CTN
  • FMO, I want Traefik CNT > Varnish Cache > (if contain is not cached...) > Ghost CTN > MySQL CTN

Minio storage for our private Docker registry

  • All nodes in the cluster shall have access to Minio bucket
  • Would be nice to use Backblaze B2 as storage provider - wip
  • To consider | https://github.com/cloudflavor/miniovol
  • Storage pricing is key. No AWS S3.
  • Backblaze is the best deal at the moment. I use them to do our back up.
  • maybe REX-Ray
  • To test | https://twitter.com/askpascalandy/status/862271673072058368
  • https://github.com/codedellemc/labs
  • maybe Portworx and Minio together
  • https://www.youtube.com/watch?v=5gRQN9WxsIk

Deploy a HA MySQL database

  • 2019-02-26: See https://gist.github.com/pascalandy/5735bcae8257e861f29e06da46754aef
  • I use Percona and I should be able to do HA. I don't know how yet
  • Galera Cluster looks promising
  • Mysql 8 will support HA natively
  • At the moment, I run one instance of Percona (no HA). Resilio syncing a common directory between 3 nodes.
  • Still trying to find a solution to easily run a MySQL cluster master-master-master
  • To consider | https://github.com/pingcap/tidb
  • This setup looks promising but it’s not quite perfect yet.
  • http://severalnines.com/en/mysql-docker-deploy-homogeneous-galera-cluster-etcd
  • https://github.com/pingcap/docs/blob/master/op-guide/docker-deployment.md

Monitoring our DB | PMM

ChatOps

DROPPED. Ops will use a terminal. That's it.

It would be nice to use Slack as a terminal. Why is that?? Here is my use case.

I want to let none-technical folks (the operations) run Docker stack without having to setup their user/pass/local environment and all the pain that come with welcoming a new user in your DevOps stuff. I assume I could prevent from doing some actions as well like rm *.

Go back to Table of content


Up and running

To see how we roll (technically speaking) at FirePress, please check the post What kind of Back-End drives FirePress.

In short we have hosting challenges. Think static website and blog/CMS (Ghost) sites. This site is actually running within a container at http://firepress.org/en/. The home page is running into another one at http://firepress.org/.

  • βœ… Our stack is cloud agnostic. No AWS/Azure/Google locked in.
  • βœ… We use Ubuntu servers a deploy them via CLI Docker Machine
  • βœ… We configure our servers via a bash script / docker-machine. No need for teraform at the moment but probably will some day.
  • βœ… We set UFW rules to work along Docker
  • βœ… We run services docker service create (well 95% of the time).
  • βœ… We use Resilio service to share a common folder between all nodes. Looking to switch… see below.
  • βœ… Reverse proxy to redirection public traffic
  • βœ… Docker label and deploy services against those constraints

βœ… Fancy bash script to launch services like:

  • Traefik
  • Percona (MySQL)
  • Ghost
  • Nginx
  • Portainer
  • Sematext
  • rClone

Most containers are built on Alpine.

  • βœ… We deploy each website via an unique ID
  • βœ… Generate dynamic landing page via a script from an HTML template. Nothing fancy yet, but great at this stage.

βœ… Our back up processes are solid.

  • Via cron
  • Internval: every 4 hours, every day
  • Compressed and encrypt before going outside the cluster on Backblaze B2.
  • Notified in Slack when the backup is done
  • Keeping only the last 2 backup on the DB node
  • Swarm (raft) is also backed up
  • βœ… Cron docker system prune --all --force on each node
  • βœ… Cron back up the Swarm Raft

βœ… Docker build

  • Highly standardized for all containers
  • Tagging edge, stable, version are made automatically. We build our containers simply by running ./builder.sh + directory name
  • Versioning is A1. We use tags: edge and stable
  • βœ… We deploy our web app with a PathPrefix (Traefik)
  • mycie.com/green/
  • mycie.com/blue/
  • mycie.com/yellow/
  • We use Cloudflare CLI - Create, update, delete | Zone, A, CNAME etc via flarectn which run within a sporadic container

βœ… We contribute to making Docker a better place

Monitoring stack Swarmprom / portainer

  • Metrics | Collects, processes, and publishes metrics
  • Intel Snap | Collects, processes, and publishes metrics
  • InfluxDB | Stores metrics
  • Grafana | Displays metrics visually
  • Logs ELK (ElasticSearch, Logstash, Kibana)
  • Alerts management (i.e. one node is not responsive)
  • Monitoring Percona Mysql performance DB (in docker of course)

Traefik config

  • Traefik is a beast. So many configs!
  • Traefik allows me to automatically create https for each site. But I can’t make it work along Cloudflare service. It’s one or the other. I’m screwed so I don’t use SSL at the moment.
  • Test ACME renewal

DNS load balance BEFORE hitting the swarm cluster

  • Challenge β€” At the moment, Cloudflare point to to ONE node. If this node crash, all our site goes down !
  • Cloudflare are working on their load balancing solution but let's be proactive. See this ticket.
  • We need a health check to see if our 3 managers are health and do a round robin sticky session between them. If one manager is not healthy, the round-robin system shall stop sending traffic to this node. If node Leader 1 is down, the system shall point traffic to node Leader 2 or 3 (health check).

Zero-downtime deployments with rolling upgrades

Find the best practice to update each node

  • At the moment the docker deamon needs to restart... and the DB goes down for 1-2 minutes

Redirect path to domain.com/web'/'

Deploying servers at scale

  • Build a Packer / Terraform routine to deploy new nodes (see also SCW Builder)
  • Minimize manual processes (of running bash scripts) to setup up Docker Swarm join / Gluster, UFW rules for private networks
  • Better use of Docker-machine so I can use eval more efficient instead of switching between terminal windows.

CICD

  • Of course one day it will make sense to get there
  • I don't feel the need for this at the moment, the docker workflow by itself is solid enough
  • Would be great to rebuild image every night

Go back to Table of content


Get involved

If you have solid skills πŸ€“ with Docker Swarm, Linux bash and the gang* and you would love to help a startup to launch πŸ”₯ a solid project, I would love to get to know you 🍻. Buzz me πŸ‘‹ on Twitter @askpascalandy. You can see the things that are done and the things we have to do here.

I’m looking for bright and caring people to join this journey with me.

To see how we roll (technically speaking) at FirePress, please check the post What kind of Back-End drives FirePress.

We are hosting between 30 to 60 websites/en/services at any given moment. Not so much at this point as we are in the Beta phase. I’m looking to define an official SLA for our stack.

In short we have hoster challenges. Think static website and blog/CMS (Ghost) sites. This site is actually running within a container at firepress.org/en/.

Thanks in advance!
Pascal


Go back to Table of content