How to Import a Large Data Set Using Core Data

When importing large data sets, the primary goals are minimizing the memory usage, the duration of the import process, and keeping the work off the main thread as much as possible.

In this post, we are going to explore the performance differences between two commonly used core data stacks on a large data set. For the purpose of this post, we will define a large data set as 300.000 entities consisting of two string attributes.

We are going to compare a stack where background context is a child of the main context, and the main context is connected to persistent store coordinator (aka nested contexts), and stack where background context and the main context are independent and both connected to persistent store coordinator (aka sibling contexts).

For the import operation, it makes sense to use a background context because we want to keep the work off the main thread.

In the case of nested contexts, when you make a save on the background context, changes are merely pushed to the parent context. Only after invoking save on parent context, changes are pushed to the persistent store.

The advantage of this stack is the fact that inconsistencies between main context and a persistent store cannot happen, because every object passes through main context before reaching the store.

The disadvantage is that pushing to parent context is memory expensive operation. Even if you save contexts periodically and release child context at the end, parent context will grow in memory and only resetting can help, but the resetting of the main context is something we want to avoid in practice.

Memory profiler shows calls of an internal method i.e NSManagedObjectContext(_NSInternalAdditions) _changeIDsForManagedObjects:toIDs: as responsible for extra memory.

Memory usage in case of using nested contexts with periodic saves.

In the case of sibling contexts, saving of the background context pushes objects directly to the store. You can choose whether you want to merge changes to the main context or not. From memory usage perspective this is more appealing. The problem is that inconsistencies between the main context and persistent store can happen (more on that later).

Problems with memory can also happen in the case of the sibling contexts, if you are deferring save until the end of import. Context will become too large during the import process, and use a lot of memory as it can be seen in the following picture.

You can improve this if you save and reset the context periodically (i.e. every X items). Periodical resets will clear the memory periodically, and therefore prevent it from growing. Of course, you want to save the context before every reset so that you don’t lose any data. It can be seen in the following picture that improvement is drastic.

Results of measuring:

  • I. Nested contexts with one save and reset of background context and one save of main context.
  • II. Nested contexts with periodic saving and resetting of background context and periodic saving of main context.
  • III. Sibling contexts with one save and reset of background context and one merge to main context.
  • IV. Sibling contexts with periodic saving and resetting of background context and periodic merging to main context.

From results of measuring, it can be clearly seen that last method trumps others in all three aspects and therefore is preferable choice for this kind of problem.

Caveat

When merging changes to main context in case of sibling contexts, you have to be aware that mergeChanges(fromContextDidSave:) method behaves differently on iOS 9 and iOS 10.

On iOS 9, it refreshes only objects that are currently registered in target context, while on iOS 10 it refreshes all of the updated objects from source and target context.

Take a look at the following scenario where you might enter into undesired behavior if you are targeting iOS 9.

  • 1. You instantiate NSFetchedResultsController with NSManagedObjectContext and NSFetchRequest that has some NSPredicate specified.
  • 2. You bind it to UITableView and display data. Only entries that match specified predicate are shown. So far so good.
  • 3. With another context you make changes to some entries in the persistent store that previously did not match specified predicate in a way that they match the predicate after the change.
  • 4. You merge changes to NSFetchedResultsController’s context.
  • 5. New entries won’t be displayed because they were not previously registered in the target context. It is definitely not the behavior we want because there is the inconsistency between displayed data and data in the store.

This behavior was reported long ago but it was ignored until iOS 10.

There are few ways how to fix this on iOS 9.

Before the merge, you can call willAccessValue(forKey:) on updated objects from notification which are not registered in the mainContext with nil argument. That will bring those objects from store or cache into the mainContext’s registered objects. You can achieve that with following code.

let objectsFromNotification = notification.userInfo?[NSUpdatedObjectsKey] as! Set<NSManagedObject>
let objectsToUpdate = objectsFromNotification.subtracting(mainContext.registeredObjects)
for object in objectsToUpdate {
  let obj = mainContext.object(with: object.objectID)
  obj.willAccessValue(forKey: nil)
}

After that, you can call mergeChanges(fromContextDidSave:) and it will do its job.

Another way is to manually refresh the updated objects that were not in mainContext’s registered objects after merging the changes. That is achieved with following code.

let objectsFromNotification = notification.userInfo?[NSUpdatedObjectsKey] as! Set<NSManagedObject>
let objectsToUpdate = objectsFromNotification.subtracting(mainContext.registeredObjects)
for object in objectsToUpdate {
  let obj = try! mainContext.existingObject(with: object.objectID)
  mainContext.refresh(obj, mergeChanges: true)
}

The interesting thing is neither of these two approaches adds additional performance overhead.

Full source code is available on GitHub.