Continuous Deployment • In 2012, I spoke many times on continuous deployment. • But changing from release cycles to continuous deployment is too big a change for most organization, and they don't have the tools to do it.
Goal • I'm hoping that adding new metrics to the application becomes so addictive that you'll want to shorten release cycles.
What is DevOps? • Puppet, Chef, Annsible? • GitHub? AWS? The Cloud? • Continuous Deployment? Yes, but these are tools. Great tools.
It's About Communication • Between machines • Between team members • Between Dev and Ops But in many companies there is a bigger problem
You're Invisible • If you are in Business, you are invisible to Development and Tech Operations • If you are in Operations, you are invisible to Business and Development • If you are in Development, you are invisible to Business and Operations.
Invisible Things Aren't Valued
Developer • "I don't know what my code will do in production and ops and let's them deal with it. • "Why doesn't ops fix these problems." • "What does Ops do all day?"
Business • Why do I have to wait till end of the month for a report? • "Did the last weeks release change anything?" • "What don't they understand the impact of that bug, outage, etc?"
Operations • Why are they always bothering me. • I've got work to do! • Why do we have do another release again... can't developers do a better job? • "What does this company do?" (really)
This is really destructive To you To your Team To your company.
All of This Can Fixed By Making Operations Visible with data Not just technical operations but company operations.
Your company is full of data! So Why Not Expose This Data? Here's a list of excuses I've heard
"But I already have graphing in my alerting system" • Maybe. But it's junk • Can't share • Can't do data mash-ups • Can't do data transformations
"They wouldn't understand." • "They won't understand the data so what's the point of sharing it." • First, "they" probably do. And more people looking at ops metrics, the better. • Us vs. Them = Fail.
"They might break something." • "The data is in our alerting system, we don't want you to break it." • Assumes "they" are incompetent, or malicious. Learn to trust.
"It's not your job, so you don't need to know." "That information isn't important" • This excuse is typically caused by fear. • Why are you deciding what's important?
"I'm not making another system, duplicating data is bad." • For operational metrics is very ok to have a redundant copy of data. • Completely different goals. • Use as alerting-beta
"I'm too busy." "It's too dangerous" "I don't know how." • These are real problems. • So let's fix it!
One Machine, One Day, One Person Challenge! Let's get 100% of operational metrics in, and enable the application to make and share new metrics on demand without any help from you.
Graphite isn't Perfect • Documentation isn't great (but getting better) • A few QA issues • Somewhat odd stack (python-twisted, django)
Graphite Ecosystem • Flexible input and output • REST API for graphs • Simple UI for mashups and dashboards • 3rd party, custom, client-side dashboards
Makes Sharing Easy • Do you have an interesting graph? It's just a URL! • Dashboards are easy since graphs are just URLs. Very easy to make HTML dashboards.
One Machine One Day! • A single low-end machine should have capacity for a few thousand metrics per minute from 50+ machines. • Graphite is not CPU intensive, but needs fast disks and/or more memory.
One Day, One Person • Graphite is not hard to install, but it is a bit messy. • But might be as easy as "apt-get install graphite" on your system. • It would be good to have a workshop or prebuilt AMI for EC2 • But not today :-(
Operational Stats • You could parse /proc, ps, df, netstat, etc and write your own custom scripts.... • ...or use Diamond from BrightCove •https://github.com/ BrightcoveOS/Diamond
Metrics in Diamond now • Memory • Apache • CPU • NGINX • Disk • MySQL • Network • SNMP and many more
100% of pure operational metrics are now shared! But what about the your applications? And business metrics?
Enter StatsD • https://github.com/etsy/statsd • Your application sends event data to statsd, as it happens, in real-time. • StatsD collects this data and computes time-series metrics (sum, min, max, average) • Once a minute, it writes data to Graphite
The Magic of UDP • Your application sends metrics in a UDP packet. • UDP is error-free. No exceptions, No timeouts. It can not cause your application to crash • It will not overload your network. • You may lose metrics, but in an intranet, it's rare.
Let's Count Logins! • Most StatsD client APIs are one-file, no C, simple. • Add one line to your login code. StatsD::increment('logins'); • That's it!
Events! • You can also graph low-frequency events. • Just send another StatsD request in your batch script StatsD::increment("deploy", 1); • Do it on reboots, installs, core dumps. • New bugs, new hires, new code commits. • Use drawAsInfinite to display
Server Server Server login,1 login,1 login,1 StatsD deploy,1 (login,3), (deploy,1) Deploy Script Graphite