Technical challenges

by FirePress Team 6 years ago 7 min read

Hi!

First you might be interested by our Roadmap. This page is about deep technical & architectural challenge we have.

Sharing challenges feels like the right thing to do as I get so much from the open-source community. If this can help people to better understand what we are building here, I'm glad to share it.

NOTE: The text below is written by a voice recognition software. It's might look funny and is not edited by a human.

Backlog

🙊 Container as an external hard drive

User stories / specs

As a DevOps hero:

As a DevOps hero, I'm looking for a nfs/zfs/GlusterFS or whatever application that mounts a common directory between all my nodes.
This needs to run as a docker service create XYZ --global with Docker Swarm. No manual configs on each node and no hard IP to set up.
As a DevOps hero, I want to create a new node on my existing cluster. The data should sync automatically.
As a DevOps hero, I want to have a common directory (not a docker volume) that all nodes can share. Something like /mnt/shared/permdata/

Per example, I would use it this way:

/mnt/shared/permdata/app1/
/mnt/shared/permdata/app2/
/mnt/shared/permdata/bkp/
/mnt/shared/permdata/etc/

Work around

At the moment I use Resilio which is great. The thing I don't like is the fact that it use the the public network to sync. There is no need for this. I want my service to use only the swarm network of my choice.

Maybe I could force resilio to sync only within an overlay network?

by: Pascal Andy / 2019-02-26

🙊 Cluster crash mitigation

EDIT: 2019-02-26_15h05: The scenario below is well managed. It's not in prod yet only because we don't have a lot of nodes at the moment. It would be too costly for now. But everything is in place to make it work very quickly.

Scenario:

This is a big one. Let's say a whole cluster is not available for 6 hours. Whatever the reason. Shall we, as a business, cry on Twitter that our server vendor are down? Absolutely not! Remember the S3 crash in April 2017? Shit happens and I don't want this to happen to us at FirePress.

The idea here is that we would have two independent clusters running in two zones (data centre).

50% of our clients are in NYC
50% of our clients are in AMS

Let's say NYC crash. Fuck. OK no panic.

Deploy 100% of our clients to AMS.

The challenge is to this very quickly. Database merging + picture merging.

Then, went things are back to normal, redistribute 50%/50%.

With this setup, it also allows an easy transition from one cluster to a new one. I love it. Don't patch. Scrap and start from scratch.

🙊 Roadmap

https://trello.com/b/0fCwwzqc/firepress-roadmap

Go back to Table of content

Dropped

Caching website / blogs

Challenge — Add a Varnish caching container for each blog (or maybe one for every domain we host??)
CMO, a request goes to Traefik CNT > Ghost CTN > MySQL CTN
FMO, I want Traefik CNT > Varnish Cache > (if contain is not cached...) > Ghost CTN > MySQL CTN

Minio storage for our private Docker registry

All nodes in the cluster shall have access to Minio bucket
Would be nice to use Backblaze B2 as storage provider - wip
To consider | https://github.com/cloudflavor/miniovol
Storage pricing is key. No AWS S3.
Backblaze is the best deal at the moment. I use them to do our back up.
maybe REX-Ray
To test | https://twitter.com/askpascalandy/status/862271673072058368
https://github.com/codedellemc/labs
maybe Portworx and Minio together
https://www.youtube.com/watch?v=5gRQN9WxsIk

Deploy a HA MySQL database

2019-02-26: See https://gist.github.com/pascalandy/5735bcae8257e861f29e06da46754aef
I use Percona and I should be able to do HA. I don't know how yet
Galera Cluster looks promising
Mysql 8 will support HA natively
At the moment, I run one instance of Percona (no HA). Resilio syncing a common directory between 3 nodes.
Still trying to find a solution to easily run a MySQL cluster master-master-master
To consider | https://github.com/pingcap/tidb
This setup looks promising but it’s not quite perfect yet.
http://severalnines.com/en/mysql-docker-deploy-homogeneous-galera-cluster-etcd
https://github.com/pingcap/docs/blob/master/op-guide/docker-deployment.md

Monitoring our DB | PMM

https://www.percona.com/en/2017/04/21/percona-monitoring-management-1-1-3-now-available/

ChatOps

DROPPED. Ops will use a terminal. That's it.

It would be nice to use Slack as a terminal. Why is that?? Here is my use case.

I want to let none-technical folks (the operations) run Docker stack without having to setup their user/pass/local environment and all the pain that come with welcoming a new user in your DevOps stuff. I assume I could prevent from doing some actions as well like rm *.

Go back to Table of content

Up and running

To see how we roll (technically speaking) at FirePress, please check the post What kind of Back-End drives FirePress.

In short we have hosting challenges. Think static website and blog/CMS (Ghost) sites. This site is actually running within a container at http://firepress.org/en/. The home page is running into another one at http://firepress.org/.

✅ Our stack is cloud agnostic. No AWS/Azure/Google locked in.
✅ We use Ubuntu servers a deploy them via CLI Docker Machine
✅ We configure our servers via a bash script / docker-machine. No need for teraform at the moment but probably will some day.
✅ We set UFW rules to work along Docker
✅ We run services docker service create (well 95% of the time).
✅ We use Resilio service to share a common folder between all nodes. Looking to switch… see below.
✅ Reverse proxy to redirection public traffic
✅ Docker label and deploy services against those constraints

✅ Fancy bash script to launch services like:

Traefik
Percona (MySQL)
Ghost
Nginx
Portainer
Sematext
rClone

Most containers are built on Alpine.

✅ We deploy each website via an unique ID
✅ Generate dynamic landing page via a script from an HTML template. Nothing fancy yet, but great at this stage.

✅ Our back up processes are solid.

Via cron
Internval: every 4 hours, every day
Compressed and encrypt before going outside the cluster on Backblaze B2.
Notified in Slack when the backup is done
Keeping only the last 2 backup on the DB node
Swarm (raft) is also backed up
✅ Cron docker system prune --all --force on each node
✅ Cron back up the Swarm Raft

✅ Docker build

Highly standardized for all containers
Tagging edge, stable, version are made automatically. We build our containers simply by running ./builder.sh + directory name
Versioning is A1. We use tags: edge and stable
✅ We deploy our web app with a PathPrefix (Traefik)
mycie.com/green/
mycie.com/blue/
mycie.com/yellow/
We use Cloudflare CLI - Create, update, delete | Zone, A, CNAME etc via flarectn which run within a sporadic container

✅ We contribute to making Docker a better place

Feature Request: Show --global instance numbers when docker service
Fixed — https://github.com/moby/moby/issues/27670
Scheduler limits the # of ctn at 40 per nodes worker (overlay network limit is 252 ctn) | Swarm 1.12.1
Fixed — https://github.com/moby/moby/issues/26702

Monitoring stack Swarmprom / portainer

Metrics | Collects, processes, and publishes metrics
Intel Snap | Collects, processes, and publishes metrics
InfluxDB | Stores metrics
Grafana | Displays metrics visually
Logs ELK (ElasticSearch, Logstash, Kibana)
Alerts management (i.e. one node is not responsive)
Monitoring Percona Mysql performance DB (in docker of course)

Traefik config

Traefik is a beast. So many configs!
Traefik allows me to automatically create https for each site. But I can’t make it work along Cloudflare service. It’s one or the other. I’m screwed so I don’t use SSL at the moment.
Test ACME renewal

DNS load balance BEFORE hitting the swarm cluster

Challenge — At the moment, Cloudflare point to to ONE node. If this node crash, all our site goes down !
Cloudflare are working on their load balancing solution but let's be proactive. See this ticket.
We need a health check to see if our 3 managers are health and do a round robin sticky session between them. If one manager is not healthy, the round-robin system shall stop sending traffic to this node. If node Leader 1 is down, the system shall point traffic to node Leader 2 or 3 (health check).

Zero-downtime deployments with rolling upgrades

Will be fixed by the docker team
https://github.com/moby/moby/issues/30321

Find the best practice to update each node

At the moment the docker deamon needs to restart... and the DB goes down for 1-2 minutes

Redirect path to domain.com/web'/'

Known issue with Traefik See https://github.com/containous/traefik/issues/1123#issue-205597693
Should be fix on traefik with 1.3. See PR

Deploying servers at scale

Build a Packer / Terraform routine to deploy new nodes (see also SCW Builder)
Minimize manual processes (of running bash scripts) to setup up Docker Swarm join / Gluster, UFW rules for private networks
Better use of Docker-machine so I can use eval more efficient instead of switching between terminal windows.

CICD

Of course one day it will make sense to get there
I don't feel the need for this at the moment, the docker workflow by itself is solid enough
Would be great to rebuild image every night

Go back to Table of content

Get involved

If you have solid skills 🤓 with Docker Swarm, Linux bash and the gang* and you would love to help a startup to launch 🔥 a solid project, I would love to get to know you 🍻. Buzz me 👋 on Twitter @askpascalandy. You can see the things that are done and the things we have to do here.

I’m looking for bright and caring people to join this journey with me.

To see how we roll (technically speaking) at FirePress, please check the post What kind of Back-End drives FirePress.

We are hosting between 30 to 60 websites/en/services at any given moment. Not so much at this point as we are in the Beta phase. I’m looking to define an official SLA for our stack.

In short we have hoster challenges. Think static website and blog/CMS (Ghost) sites. This site is actually running within a container at firepress.org/en/.

Thanks in advance!
Pascal

Go back to Table of content

FirePress Team

At FirePress 🔥📰, our mission is to empower freelancers and small organizations publish their website while having fun do it.

Technical challenges

Table of Content

Backlog

🙊 Container as an external hard drive

🙊 Cluster crash mitigation

🙊 Roadmap

Dropped

Caching website / blogs

Minio storage for our private Docker registry

Deploy a HA MySQL database

Monitoring our DB | PMM

ChatOps

Up and running

Monitoring stack Swarmprom / portainer

Traefik config

DNS load balance BEFORE hitting the swarm cluster

Zero-downtime deployments with rolling upgrades

Find the best practice to update each node

Redirect path to domain.com/web'/'

Deploying servers at scale

CICD

Get involved

FirePress Team

Get FireNews 🙌

Great!

May we suggest a tag?