Ammon Sutherland April 23, 2013 Friday, April 26, 13
Preface... "Who is it?" said Arthur. "Wel ," said Ford, "if we're lucky it's just the Vogons come to throw us into space." "And if we're unlucky?" "If we're unlucky," said Ford grimly, "the captain might be serious in his threat that he's going to read us some of his poetry first ..."
Background 3 • Long- ‐time Linux System Administrator turned DBA – University systems – Managed Hosting – Online Auctions – E- ‐commerce, SEO, marketing, data- ‐mining A bit of an optimization junkie… Once in a while I share: http://shamallu.blogspot.com/
5 Basic Theory deadlock detected we rol back transaction two err one two one three - ‐ A MySQL Haiku - ‐
Directory Structure 6 Things that must be stored on disk • Data files (.ibd or .MYD and .MYI) – Random IO • Main InnoDB data file (ibdata1) – Random IO • InnoDB Log files (ib_logfile0, ib_logfile1) – Sequential IO (one at a time) • Binary logs and relay logs – Sequential IO • General query log and Slow query log – Sequential IO • Master.info – technically Random IO • Error log – Infrequent Sequential IO
Linux IO Sub- ‐System 7
Hard Drives 8 • Rotating platters • SAS vs. SATA – SAS 6gb/s connectors can handle SATA 3gb/s drives – SAS typically cost more (much more for larger size) – SAS often wil do higher rpm rates (10k, 15k rpm) – SAS has more logic on the drives – SAS has more data consistency and error reporting logic vs. SATA S.M.A.R.T. – SAS uses higher voltages allowing for external arrays with longer signal runs – SAS does TCQ vs. SATA NCQ (provides some similar effect) – Both do 8b10b encoding (25% parity overhead)
SSD 9 • Pros: – Very fast random reads and writes – Handle high concurrency very wel • Cons: – Cost per GB – Lifespan and performance depend on write- ‐cycles. Beware write amplification – Requires care with RAID cards
RAID 10 Typical RAID Modes: • RAID- ‐0: Data striped, no redundancy (2+ disks) • RAID- ‐1: Data mirrored, 1:1 redundancy (2+ disks) • RAID- ‐5: Data striped with parity (3+ disks) • RAID- ‐6: Data striped with double parity (4+ disks) • RAID- ‐10: Data striped and mirrored (4+ disks) • RAID- ‐50: RAID- ‐0 striping of multiple RAID- ‐5 groups (6+
RAID (cont.) 11 Typical RAID Benefits and risks: • RAID- ‐0 - ‐ Scales reads and writes, multiplies space (risky, no disks can fail) • RAID- ‐1 - ‐ Scales reads not writes, no additional space gain (data intact with only one disk and rebuilt) • RAID- ‐5 - ‐ Scales reads and some writes (parity penalty, can survive one disk failure and rebuild) • RAID- ‐6 - ‐ Scales reads and less writes than RAID- ‐5 (double parity penalty, can survive 2 disk failures and rebuild) • RAID- ‐10 - ‐ Scales 2x reads vs writes, (can lose up to two disks in particular combinations) • RAID- ‐50 - ‐ Scales reads and writes (can lose one disk per RAID- ‐5 group and stil rebuild)
RAID Cards 12 • Purpose: – Offload RAID calculations from CPU, including parity – Routine disk consistency checks – Cache • Tips: – Control er Cache is best mostly for writes – Write- ‐back cache is good - ‐ Beware of “learn cycles” – Disk Cache - ‐ best disabled on SAS drives. SATA drives frequently use for NCQ – Stripe size - ‐ should be at least the size of the basic block being accessed. Bigger usually better for larger files – Read ahead - ‐ depends on access patterns
LVM 13 Why use it? • Ability to easily expand disk • Snapshots (easy for dev, proof of concept, backups) Cost? • Straight usage usually 2- ‐3% performance penalty • With 1 snapshot 40- ‐80% penalty • Additional snapshots are only 1- ‐2% additional penalty each
IO Scheduler 14 Goal - ‐ minimize seeks, prioritize process io • CFQ - ‐ multiple queues, priorities, sync and async • Anticipatory - ‐ anticipatory pauses after reads, not useful with RAID or TCQ • Deadline - ‐ "deadline" contract for starting all requests, best with many disk RAID or TCQ • Noop - ‐ tries to not interfere, simple FIFO, recommended for VM's and SSD's
Filesystem Concepts 15 • Inode - ‐ stores, block pointers and metadata of a file or directory • Block - ‐ stores data • Superblock - ‐ stores filesystem metadata • Extent - ‐ contiguous "chunk" of free blocks • Journal - ‐ record of pending and completed writes • Barrier - ‐ safety mechanism when dealing with RAID or disk caches • fsck - ‐ filesystem check
VFS Layer 16 • API layer between system calls and filesystems, similar to MySQL storage engine API layer
Linux IO Sub- ‐System 17
18 Filesystem Choices In the style of Edgar Al an Poe’s “The Raven”… Once upon a SQL query While I joked with Apple's Siri Formatting many a logical volume on my quad core Suddenly there came an alert by email as of some threshold starting to wail wailing like my SMS tone "Tis just Nagios" I muttered, "sending alerts unto my phone, Only this - ‐ I might have known."
Ext filesystems 19 • ext2 - ‐ no journal • ext3 - ‐ adds journal, some enhancements like directory hashes, online resizing • ext4 - ‐ adds extents, barriers, journal checksum, removes inode locking • common features - ‐ block groups, reserved blocks • ex2/3 max FS size=32 TiB, max file size=2 TiB • ext4 max FS size=1 EiB, max file size=16 TiB
XFS 20 • extents, data=writeback style journaling, barriers, delayed allocation, dynamic inode creation, online growth, cannot be shrunk • max FS size=16 EiB, max file size 8 EiB
Btrfs 21 • extents, data and metadata checksums, compression, subvolumes, snapshots, online b- ‐ tree rebalancing and defrag, SSD TRIM support • max FS size=16 EiB, max file size 16 EiB
ZFS* 22 • volume management, RAID- ‐Z, continuous integrity checking, extents, data and metadata checksums, compression, subvolumes, snapshots, encryption, ARC cache, transactional writes, deduplication • max FS size=16 EiB, max file size 16 • * note that not all these features are yet supported natively on Linux
24 MySQL Tuning Options Continuing in the style of “The Raven”… Ah distinctly I remember as I documented for each member of the team just last Movember in the wiki that we keep write and keep and nothing more… When my query thus completed Fourteen duplicate rows deleted Al my replicas then repeated repeated the changes as before I dumped it al to a shared disk kept as a backup forever more.
InnoDB Flush Method 26 • Applies to InnoDB Log and Data file writes • O_DIRECT - ‐ “Try to minimize cache effects of the I/O to and from this file. In general this wil degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user space buffers.” - ‐ Applies to log and data files, fol ows up with fsync, eliminates need for doublewrite buffer • DSYNC - ‐ “Write I/O operalons on the ﬁle descriptor shall complete as deﬁned by synchronized I/O data integrity complelon.” - ‐ Applies to log ﬁles, data ﬁles get fsync • fdatasync - ‐ (deprecated option in 5.6) Default mode. fdatasync on every write to log or disk • O_DIRECT_NO_FSYNC - ‐ (5.6 only) O_DIRECT without fsync (not suitable for XFS) • fsync - ‐ flush all data and metadata for a file to disk before returning • fdatasync - ‐ flush all data and only metadata necessary to read the file properly to disk before returning
InnoDB Flush Method - ‐ Notes 27 • O_DIRECT - ‐ “The thing that has always disturbed me about O_DIRECT is that the whole interface is just stupid, and was probably designed by a deranged monkey on some serious mind- ‐control ing substances.” - ‐- ‐Linus Torvalds • O_DIRECT - ‐ “The behaviour of O_DIRECT with NFS wil differ from local file systems. Older kernels, or kernels configured in certain ways, may not support this combination. The NFS protocol does not support passing the flag to the server, so O_DIRECT I/O wil only bypass the page cache on the client; the server may stil cache the I/O. The client asks the server to make the I/O synchronous to preserve the synchronous semantics of O_DIRECT. Some servers wil perform poorly under these circumstances, especially if the I/O size is small. Some servers may also be configured to lie to clients about the I/O having reached stable storage; this wil avoid the performance penalty at some risk to data integrity in the event of server power failure. The Linux NFS client places no alignment restrictions on O_DIRECT I/O.” • DSYNC - ‐ “POSIX provides for three diﬀerent variants of synchronized I/O, corresponding to the ﬂags O_SYNC, O_DSYNC, and O_RSYNC. Currently (2.6.31), Linux only implements O_SYNC, but glibc maps O_DSYNC and O_RSYNC to the same numerical value as O_SYNC. Most Linux ﬁle systems don't actually implement the POSIX O_SYNC semanqcs, which require all metadata updates of a write to be on disk on returning to user space, but only the O_DSYNC semanqcs, which require only actual ﬁle data and metadata necessary to retrieve it to be on disk by the qme the system call returns.”
28 Benchmarks There once was a smal database program
It had InnoDB and MyISAM
One did transactions wel ,
and one would crash like hel Between the two they used al of my RAM - ‐ A database Limerick - ‐
Testing Setup... 29 • Dell PowerEdge 1950 – 2x Quad- ‐core Intel Xeon 5150 @ 2.66 Ghz – 16 GB RAM – 4 x 300 GB SAS disks at 10k rpm (RAID- ‐5, 64KB stripe size) – Del Perc 6/i RAID Control er with 512MB cache – CentOS 6.4 (sysbench io tests done with Ubuntu 12.10) – MySQL 5.5.30
Mount Options 35 ext2: noatime ext3: noatime ext4: noatime,barrier=0 xfs: inode64,nobarrier,noatime,logbufs=8 btrfs: noatime,nodatacow,space_cache zfs: noatime (recordsize=16k, compression=off, dedup=off) all - noatime - Do not update access times (atime) metadata on files after reading or writing them ext4 / xfs - barrier=0 / nobarrier - Do not use barriers to pause and receive assurance when writing (aka, trust the hardware) xfs - inode64 - use 64 bit inode numbering - became default in most recent kernel trees xfs - logbufs=8 - Number of in-memory log buffers (between 2 and 8, inclusive) btrfs - space_cache - Btrfs stores the free space data ondisk to make the caching of a block group much quicker (Kernel 2.6.37+). It's a persistent change and is safe to boot into old kernels btrfs - nodatacow - Do not copy-on-write data. datacow is used to ensure the user either has access to the old version of a file, or to the newer version of the file. datacow makes sure we never have partially updated files written to disk. nodatacow gives slight performance boost by directly overwriting data (like ext), at the expense of potentially getting partially updated files on system failures. Performance gain is usually < 5% unless the workload is random writes to large database files, where the difference can become very large btrfs - compress=zlib - Better compression ratio. It's the default and safe for olders kernels btrfs - compress=lzo - Fastest compression. btrfs-progs 0.19 or olders will fail with this option. The default in the kernel 2.6.39 and newer
iobench with mount options 2500" MB/s Higher is better ext2" 2000" ext2"+"op6ons" ext3" 1500" ext3"+"op6ons" ext4" ext4"+"op6ons" 1000" xfs" xfs"+"op6ons" 500" btrfs" btrfs"+"op6ons" 0" Read"MB/s" Write"MB/s"
37 IO Scheduler Choices Round and round the disk drive spins but SSD sits stil and grins. It is randomly fast for data current and past. My database upgrade begins
OLTP Performance - ‐ 1 thread 44 2400# Time in Seconds Lower is Better 2200# 2000# 1800# 1600# 1/4#ram#0#1#thread# 1400# 1#thread,#7/8#ram# 1200# 1000# # )# # # s# # # # # # # fs# )# s# fs# )# s# fs# CT#0#xfs# CT#0#zf C#0#xfs# C#0#zf CT#0#ext2 FS#(ext2 CT#0#ext3 CT#0#ext4 CT#btr FS#(ext2 ync#0#xfs# ync#0#zf C#0#ext2 FS#(ext2 C#0#ext3 C#0#ext4 C#0#btr IRE IRE ync#0#ext2 ync#0#ext3 ync#0#ext4 ync#0#btr SYN SYN IRE IRE IRE IRE SYN SYN SYN _D _D _D _D SYN _D CT#0#N _D _D O O _D fdatas fdatas _D C#0#N _D _D O O _D O O O O fdatas ync#0#N fdatas fdatas fdatas O O O O IRE SYN _D _D O fdatas O
OLTP Performance - ‐ 16 thread 45 4000" Time in Seconds 3500" Lower is Better 3000" 2500" 2000" 1500" 16"thread"1/4"ram" 1000" 16"thread,"7/8"ram" 500" 0" " )" " " s" " " " " " " fs" )" s" fs" )" s" fs" CT"0"xfs" CT"0"zf C"0"xfs" C"0"zf CT"0"ext2 FS"(ext2 CT"0"ext3 CT"0"ext4 CT"btr FS"(ext2 ync"0"xfs" ync"0"zf C"0"ext2 FS"(ext2 C"0"ext3 C"0"ext4 C"0"btr IRE IRE ync"0"ext2 ync"0"ext3 ync"0"ext4 ync"0"btr SYN SYN IRE IRE IRE IRE SYN SYN SYN _D _D _D _D SYN _D CT"0"N _D _D O O _D fdatas fdatas _D C"0"N _D _D O O _D O O fdatas ync"0"N fdatas fdatas fdatas O O O O IRE O O SYN _D _D O fdatas O
46 AWS Cloud Options Performance, uptime, Consistency and scale- ‐up: No, this is a cloud… - ‐ A haiku on clouds - ‐
Cloud Performance 47 • EC2 - ‐ Slightly unpredictable • *Note: not my research or graphs. See blog.scalyr.com for benchmarks and writeup
48 Conclusions Oracle is Red, IBM is Blue, I like stuff for free MySQL wil do.
Conclusions 49 IO Schedulers - ‐ Deadline or Noop Filesystem - ‐ Ext3 is usually slowest. Btrfs not there quite yet but looking better. Linux zfs is cool, but performance is sub- ‐par. InnoDB Flush Method - ‐ O_DIRECT not always best Filesystem Mount options make a difference Artificial benchmarks are fun, but like most things comparative speed is very workload dependent
Parting thought Do you like MyISAM? I do not like it, Sam- ‐I- ‐am. I do not like MyISAM. Would you use it here or there? I would not use it here or there. I would not use it anywhere. I do not like MyISAM. I do not like it, Sam- ‐I- ‐am. Would you like it in an e- ‐commerce site? Would you like it with in the middle of the night? I do not like it for an e- ‐commerce site. I do not like it in the middle of the night. I would not use it here or there. I would not use it anywhere. I do not like MyISAM. I do not like it Sam- ‐I- ‐am. Would you could you for foreign keys? Use it, use it, just use it please! You may like it, you wil see Just convert these tables three… Not for foreign keys, not for those tables three! I wil not use it, you let me be!