Spreading Reoccuring Jobs Out Over Time (with Rails + Sidekiq)
We had a problem. Our SaaS product was starting to show some performance problems at certain times. This, of course, wasn’t the first time we had such a problem, but this time it was different. Our background jobs, which we run with Sidekiq, were actually creating enough load on the database to interfere with the front-end web requests to the product. We rely heavily on API integrations with other products. Some of the integrations are done by “brute force” due to limitations with these APIs. We’re forced to run jobs at an interval, that query the 3rd party APIs on a user by user basis in order to check for changes in data. This particular API doesn’t have any “lightweight” APIs to help us check for changes, so we need to look at a lot of data and work out that diff ourselves. This is what it looks like:
- Every 6th hour we spawn a Sidekiq job, using the Sidekiq Scheduler add-on, that let’s us specify when the jobs runs via CRON syntax.
- That job then queries our database to find accounts using this integration. It then dives deeper to find users within those accounts that are eligible to have their data sync’d against this API.
- For each eligible user, we then spawn another “child job” which will do the work to fetch data from the API, and determine whether or not certain pieces of data should be brought into our system or not.
- We use Sidekiq batching to track all the child jobs and determine whether the entire batch was successful or not. Audits are made along the way.
This all works fine, but the problem performance problem becomes obvious. With enough accounts using this integration, we would suddenly have a massive queue of jobs to run on the top of every 6th hour. Our auto-scaling kicks in, booting up more Sidekiq workers. This helps us get through the jobs pretty quickly, but the many small database read/writes for every child job start to add up when run together as fast as possible like this. Sometimes this would coincide with busy traffic, causing the database load to spike enough to cause slowdowns throughout the system. We were hurting ourselves, and on top of that, the 3rd party warned us about approaching rate limits and are unwilling to provide a lighter API for determining these changes.
The solution, for us, was also quite simple: spread (or “fan”) out the child jobs to run across the 6 hour time window. Instead of bursts of jobs, we’d have a steady stream of jobs throughout the day.
I decided to create module that could be mixed into our Sidekiq workers that required this functionality. I went with the term “Fan Out” because that seems to be the cool the way to express this (I’ve seen the term used in Elixir/Erlang circles, for example) and because saying “spread out” makes me feel funny.
This is what I ended up with:
module FanOut extend ActiveSupport::Concern def fanout(jobs, cron_override = nil) if jobs.empty? interval = 0 else interval = recurring_frequency_in_seconds(cron_override) / jobs.size end accumulator = 0 jobs.each do |job| yield job, accumulator.seconds accumulator += interval end end private def recurring_frequency_in_seconds(cron_override = nil) if cron_override.present? cron = cron_override else yaml_path = Rails.root.join("config", "sidekiq.yml") schedule_yaml = YAML.safe_load(ERB.new(File.read(yaml_path.to_s)).result, [Symbol]) schedule_yaml[:schedule].each do |entry| next unless entry["class"] == self.class.to_s cron = entry["cron"] end end if cron.present? parsed_cron = Fugit::Cron.parse(cron) return parsed_cron.rough_frequency else return 0 end end end
I added this to a file at
app/workers/concerns/fan_out.rb. It can be used in it’s simplest form like so:
class ScheduledWorkers::SmoothTest include Sidekicker include FanOut sidekiq_options queue: WorkerQueues::DEFAULT, retry: 1 def perform things = (1..120).to_a fanout(things, "0 */6 * * *") do |thing, delay_in_seconds| ScheduledWorkers::SmoothTestSingle.perform_in(delay_in_seconds, thing) end end end
FanOut module provides a new enumerator method called
fanout. In the above example I’m using it explictly by passing the CRON string
0 */6 * * * directly to it as the second argument. Now we can iterate over
things just like using an
each loop, but the CRON interval will be taken into account to provide a
delay_in_seconds calculated value that can be used to delay the child Sidekiq jobs using
For good measure, I also added in a layer of magic, that allows us to avoid passing in the CRON string altogether. We use Sidekiq Scheduler, which is configured using a
schedule section in
sidekiq.yml. By using some introspection on the class name of the worker, I can try to pull out that CRON string automatically from the YAML config file. Although I usually prefer more explicit code, this technique should help us keep things lined up. We have a set-up wherein each Sidekiq Scheduler worker’s CRON setting can be overridden by an environment variable. Our application runs as a variety of instances, where the frequency of these jobs differ based on client needs. Ideally, if we were to override the frequency to which one of these parent jobs ran, we would expect this fanout operation to adjust accordingly. This technique allows us to do so.