Best way to (re)load test data in Mongo DB
We have a system where we have to create an exact copy of the original
database for testing. The database size is over 800GB. We have found
that using zfs snapshots and zfs clone is the best option. In order to
do this, you have to enable journaling in mongodb. Then you can take a
snapshot of the live database at any time, then clone that snapshot via
"zfs clone" and finally start up a new mongodb instance on the clone.
(We are using docker containers for this, but it doesn't really matter.)
Using zfs clones has a performance penality (we have measured 80%
performance), but it is still much better than restoring such a huge
database with mongoimport. (8 hours vs. 10 seconds) Using zfs for
mongodb is not recommended anyway, but you can keep a separate replica
set member just for this purpose.
For much smaller databases, you can (of course) use pure python code to
insert test data into a test database. If it only takes seconds, then it
is not a problem, right? I believe that for small tests (e.g. unit
tests), using python code to populate a test database is just fine.
There is no "best" way to do this, and there is always a tradefoff. For
good tests, you need more data. If you want to restore more data, then
it either takes more time, or you have to use external tools.
Regarding question #2, you can always directly give an _id for documents
if you want:
This will let you create a database with known _id values, and run
repeatable tests on that database.
> I'm toying around with Pyramid and am using Mongo via MongoEngine for
> storage. I'm new to both Pyramid and MongoEngine. For every test case in
> the part of my suite that tests the data access layer I want to reload
> the database from scratch, but it feels like there should be a better
> and faster way than what I'm doing now.
> I have two distinct issues:
> 1. What's the fastest way of resetting the database to a clean state?
> 2. How do I load data with Mongo's internal _id being kept persistent?
> For issue #1:
> First of all I'd very much prefer to avoid having to use external client
> programs such as mongoimport to keep the number of dependencies minimal.
> Thus if there's a good way to do it through MongoEngine or PyMongo,
> that'd be preferable.
> My first shot at populating the database was simply to load data from a
> JSON file, use this to create my model objects (based on
> MongoEngine.Document) and save them to the DB. With a single-digit
> number of test cases and very limited data, this approach already takes
> close to a second, so I'm thinking there should be a faster way. It's
> Mongo, after all, not Oracle.
> My second version uses the underlying PyMongo module's insert_many()
> function to add all the documents for each collection in one go, but for
> this small amount of data it doesn't seem any faster.
> Which brings us to issue #2:
> For both of these strategies I'm unable to insert the Mongo ObjectId
> type _id. I haven't made _id properties part of my models, because they
> seem a bit... alien. I'd rather not include them solely to be able to
> load my test data properly. How can I populate _id as an ObjectId, not
> just as a string? (I'm assuming there's a difference, but it's never
> come up until now.)
> Am I being too difficult? I haven't been able to find much written about
> this topic: discussions about mocking drown out everything else the
> moment you mention 'mongo' and 'test' in the same search.