Agenda • A little background about why we decided to build an internal PaaS. • Introduction to Empire. • How we’re leveraging Amazon EC2 Container Service (ECS) as the backend. • Demo • Q&A
Who am I • Eric Holmes • Infrastructure Engineer at Remind • I like building things for other developers • Work mostly with Go and Ruby • You can find my open source stuff at https://github.com/ejholmes
What’s Remind? • Remind is a messaging platform for teachers, students and parents. • Chat/Announcements/Files • ~25 mil ion users. ~350,000 new users per day during BTS • ~5 mil ion messages per day. • ~50 employees. ~30 engineers.
Started as a monorail
We started growing...
Broke apart the monolith • Sidekiq queues were IO bound and constantly backed up during BTS • Message delivery workers were tightly coupled to the rest of the application. Difficult to scale out horizontally • Database would need to be sharded • Started breaking the monolith apart into loosely coupled services. • Now have ~50 production services
Heroku • Entirely hosted on Heroku • Heroku has been awesome; never needed an ops team. • Allowed us to focus on building product.
But we ran into issues... • “Internal” micro-services need to be exposed publicly. • Databases need to be opened up to all traffic. • Little visibility into performance of hosts. • No control over the routing layer.
What do we want? • Want to use AWS services. • Want to maintain operational simplicity. • Support 12 factor apps. http://12factor.net/ • Maintain shared patterns for deployment. Faster iteration and build + release cycles • No ops. • Decrease our surface area and only expose a single app publicly. • Robust and resilient to failure. Self-healing. • If we can, continue to use containers as a unit of deployment.
Why containers? • Fast to build* • Let us isolate dependencies as a portable, easy- to-distribute package. • Allow us to create better development environments with more dev/prod parity. • Limit the number of moving parts when we deploy. • Better resource utilization and cost management
We’re not the first company to want a PaaS • Netflix - Asgard • SoundCloud - Bazooka • Every other company in our investor’s portfolio...
Something we can re-use? • Flynn –Alpha –Undergoing many architectural changes –Custom load balancer • Deis –More than it needed to be –Nobody using it successfully in production (that we knew of)
Empire was born • Initially started as a management layer on top of CoreOS + fleet. • Load balancing via nginx configured through confd + etcd. • Unit of deployment was Docker containers • Implemented a subset of the Heroku API
Therein lies the rub... • Fleet initially worked well, until we started testing failure modes. • Fleet had a lot of bugs • etcd was fragile • We needed resilience and stability • We didn’t want to run and operate our own clustering.
Amazon EC2 Container Service (ECS) becomes GA • Amazon ECS became GA while we were looking for an alternative scheduler. • Looked promising to serve as the scheduling backend.
What is Amazon ECS? • Pools hosts together as a single compute resource. • Provides a set of APIs for placing tasks on machines • Scheduler supports “services” for scaling tasks horizontally and maintaining desired state. • Services integrate with ELB for connection draining, zero downtime, and healthchecks.
Amazon ECS for Empire • Solid set of primitives to serve as the scheduling backend • Managed service • Failure modes behaved as we expected them to • ELB integration allowed us to remove custom routing layer • Service discovery via DNS
What is Empire? • Open source internal PaaS for micro-services • A layer of usability on top of Amazon ECS for 12 factor apps • Single binary. Minimal deps. Easy to run. • Provides an API and CLI to create apps, deploy docker images, update configuration, run one off tasks etc. • Allows you to use Procfiles to build multiple Amazon ECS services
Is it ready for production? • Running ~15 production services within Amazon ECS managed via Empire for a little over a month • Empire is hands off after you’ve deployed. AWS services take over • Moving directly onto EC2 showed huge performance improvements for services
What does Empire not do? • Bring your own logging and metrics (soon?) • It doesn’t handle building your Docker images • Doesn’t handle the creation of attached resources like Databases