The code in this repository demonstrates how to efficiently extract large data set from a MySQL database and export it to an Excel file. It turns out that this is not as easy at it seems. Both of these tasks often leads to terrible performance and out of memory crashes with big data sets.
This example uses PDO as the MySQL driver as well as Spout to create XLSX files. [Read why.]
- Clone or download this repository.
- Add data in your MySQL database. This example provides test data that can be found in sql/fixtures.sql. To create a new database and import the test data into it, you can run
mysql -u {USERNAME} -p {password} < sql/fixtures.sql
. You may need to comment the first line of the sql file for the first import. - Run the example:
php index.php
These metrics have been obtained by fetching and writing 50,000 rows (200,000 cells) to a XLSX file.
Row fetching mode | Elapsed time | Memory peak |
---|---|---|
One by one | 8.32s | 1MB |
Batch (500) | 8.02s | 2.25MB |
Batch (1000) | 9.72s | 3.75MB |
All at once | 9.24s | 73.25MB |
These metrics have been obtained by fetching and writing 200,000 rows (800,000 cells) to a XLSX file.
Row fetching mode | Elapsed time | Memory peak |
---|---|---|
One by one | 32.95s | 1MB |
Batch (500) | 36.86s | 2.25MB |
Batch (1000) | 38.70s | 3.75MB |
All at once | 40.61s | 289MB |
Even though PHPExcel is a great and popular library, it reaches its limits with large data sets. Spout was designed to support any size of data sets, making it super easy to scale without needing to worry about caching, optimizations, etc.
To support the previous claim, the same set of metrics was collected, but using PHPExcel instead of Spout. Here are the results obtained:
Num written rows | Metric | PHPExcel | Spout | Ratio |
---|---|---|---|---|
50,000 | Elapsed time | 33.56s | 8.02s | 4.2x |
Memory peak | 186MB | 2.25MB | 83x | |
200,000 | Elapsed time | 172.91s | 36.86s | 4.5x |
Memory peak | 720.5MB | 2.25MB | 320x |
In this example, several techniques are used to achieve this goal without compromising the performance.
The single row fetching mode uses unbuffered queries. Contrary to buffered queries where the whole result set is immediately sent to the client and stored in memory, the results of unbuffered queries are stored on the MySQL server and can be fetched one row at a time. This greatly reduces the amount of memory needed by the PHP script since only one row is stored in memory at any given time.
The batch fetching mode uses a primary key to split the final result set in small chunks. Most importantly, it does not use OFFSET
to fetch next batches but remembers the ID of the last fetched row and uses this as the offset (WHERE id > $lastFetchedId
). This works great for very large tables and prevents a bad full table scan. By using LIMIT XXX
, only few rows are stored in memory at any given time.
Finally, Spout uses streams to write the data it is being passed. Streaming means that Spout only acts as a passthrough, and once the data is written, it just forgets about it freeing the memory.
The data set is generated by a script: gen-fixtures.php
This script contains the number of rows to generate (by default 50,000). You can update this number and run php scripts/gen-fixtures.php
to re-generate the fixtures.sql file.