How many Data? • Peak Time(10:00~10:30): • IN : 15k-20k msg per second • OUT : 30k-40k msg per second
Apps depends on Kakfa
real-time pv uv
Load use Kafka
Replace RabbitMQ RabbitMQ Kafka RabbitMQ Kafka Servers 6 1 Load >10 <2.5 Language Erlang Scala Deployment Difficult Easy Client A lot Not Many Management Web-console JMX
WHY Kafka ‘fast’
Basics • producers • consumers • consumer groups • brokers
Major Design Elements • Persistent messages • Throughput >>> features • Consumers hold states • ALL is distributed
Detail Agenda • Maximizing Performance • Zookeeper • Filesystem vs. Memory • Directory Structure • BTree? • Zero-copy • End-to-end Batch Compression • Consumer state • Message delivery semantics • Consumer state • Push vs. Pull • Message • Message format • Disk structure
Filesystem vs. Memory Maximize Performance
Who is fast?
Disk hardware linear writes random writes 6*7200rpm SATA 300MB/sec 50k/sec RAID-5
Let’s see something REAL
page cache • use free memory for disk caching to make random write fast
Drawbacks • All disk reads and writes will go through this unified cache. This feature cannot easily be turned off without using direct I/O, so even if a process maintains an in-process cache of the data, this data will likely be duplicated in OS pagecache, effectively storing everything twice.
If we use memory(JVM) • The memory overhead of objects is very high, often doubling the size of the data stored (or worse). • Java garbage collection becomes increasingly sketchy and expensive as the in-heap data increases.
cache size • at least double the available cache by having automatic access to all free memory, and likely double again by storing a compact byte structure rather than individual objects. Doing so will result in a cache of up to 28-30GB on a 32GB machine.
comparison in-disk in-memory GC no GC stop the world Initialization stay warm even if rebuilt slow(10min for restarted 10GB) and cold cache Logic handle by OS handle by programs
Conclusion • using the filesystem and relying on pagecache is superior to maintaining an in-memory cache or other structure
Go Extreme! • Write to filesystem DIRECTLY! • (In effect this just means that it is transferred into the kernel's pagecache where the OS can flush it later.)
Furthermore • You can configure: every N messages or every M seconds. It is to put a bound on the amount of data "at risk" in the event of a hard crash. • Varnish use pagecache-centric design as well.
BTree Maximize Performance
Background • Messaging system meta is often a BTree. • BTree operations are O(logN).
BTree • O(logN) ~= constant time
BTree is slow on Disk!
BTree for Disk • Disk seeks come at 10 ms a pop • each disk can do only one seek at a time • parallelism is limited • the observed performance of tree structures is often super-linear
Lock • Page or row locking to avoid lock the tree
Two Facts • no advantage of driver density because of the heavy reliance on disk seek • need small (< 100GB) high RPM SAS drives to maintain a sane ratio of data to seek capacity
Use Log file Structure!
Feature • One queue is one log file • Operations is O(1) • Reads do not block writes or each other • Decouple with data size • Retain messages after consumption
zero-copy Maximize Performance
1. The operating system reads data from the disk into pagecache in kernel space 2. The application reads the data from kernel space into a user-space buffer 3. The application writes the data back into kernel space into a socket buffer 4. The operating system copies the data from the socket buffer to the NIC buffer where it is sent over the network
zerocopy • data is copied into pagecache exactly once and reused on each consumption instead of being stored in memory and copied out to kernel space every time it is read
Key point • End-to-end: compress by producers and de-compress by consumers • Batch: compression aims to compress a ‘message set’ • Kafka supports GZIP and Snappy protocols
Facts • No ACK • Consumers maintain the message state
Features • Message is in a partition • Stored and given out in the order they arrive • ‘ watermark’ - ‘offset’ in Kafka
track state • write msg state in zookeeper • in one transaction with writing data • side benefit: ‘rewind’ msg
push vs. pull Consumer State
push system • if a consumer is <defunct>?
Kafka use pull model
Message Format & Data structure
Msg Format • N byte message: • If magic byte is 0 1. 1 byte "magic" identifier to allow format changes 2. 4 byte CRC32 of the payload 3. N - 5 byte payload • If magic byte is 1 1. 1 byte "magic" identifier to allow format changes 2. 1 byte "attributes" identifier to allow annotations on the message independent of the version (e.g. compression enabled, type of codec used) 3. 4 byte CRC32 of the payload 4. N - 6 byte payload
Log format on-disk • On-disk format of a message • message length : 4 bytes (value: 1+4+n) • ‘magic’ value : 1 byte • crc : 4 bytes • payload : n bytes • partition id and node id to uniquely identify a message
Kafka Log Implementation
Writes • Append-write • When rotate: • M : M messages in a log file • S : S seconds after last flush • Durability guarantee: losing at most M messages or S seconds of data in the event of a system crash
Buffer Reads • auto double buffer size • you can specify the max buffer size
Offset Search • Search steps: 1. locating the log segment file in which the data is stored 2. calculating the file-specific offset from the global offset value 3. reading from that file offset • Simple binary in memory
Features • Reset the offset • OutOfRangeException(problem we met)
Deletes • Policy: N days ago or N GB • Deleting while reading? • a copy-on-write style segment list implementation that provides consistent views to allow a binary search to proceed on an immutable static snapshot view of the log segments