Technology | Web

eZ Publish Fetch Functions Optimization

Each time there’s a need to get a list of articles/users/comments in eZ Publish, we use fetch functions. Fetch function can be used in a template (http://doc.ez.no/eZ-Publish/Technical-manual/4.x/Reference/Modules/content/Fetch-functions/list, http://doc.ez.no/eZ-Publish/Technical-manual/4.x/Reference/Modules/content/Fetch-functions/tree) and PHP code (https://github.com/ezsystems/ezpublish/blob/master/kernel/classes/ezcontentobjecttreenode.php#L1837). That’s why problems may arise, when you need to fetch a lot of nodes.


Below is a simple example of using eZ Publish fetch functions: command line script which fetches all published nodes. Here is the source code:


Source code    
  1. <?php
  2. set_time_limit( 0 );
  3. ini_set( 'memory_limit', '2048M' );
  4.  
  5. require 'autoload.php';
  6.  
  7. $cli = eZCLI::instance();
  8. $cli->setUseStyles( true );
  9.  
  10. $scriptSettings = array();
  11. $scriptSettings['description']    = 'Fetches all nodes and iterates them';
  12. $scriptSettings['use-session']    = true;
  13. $scriptSettings['use-modules']    = true;
  14. $scriptSettings['use-extensions'] = true;
  15. $scriptSettings['site-access']    = 'siteadmin';
  16.  
  17. $script = eZScript::instance( $scriptSettings );
  18. $script->startup();
  19. $script->initialize();
  20.  
  21. $cli->output( str_repeat( '-', 64 ) );
  22. $cli->output( 'Starting script...' );
  23. $cli->output( str_repeat( '-', 64 ) );
  24.  
  25. $startTime = microtime( true );
  26.  
  27. $nodes = eZContentObjectTreeNode::subTreeByNodeID(
  28.     array(
  29.         'Depth'       => false,
  30.         'Limitation'  => array(),
  31.         'LoadDataMap' => true
  32.     ),
  33.     1
  34. );
  35.  
  36. $count = count( $nodes );
  37. foreach( $nodes as $key => $node ) {
  38.     $object = $node->attribute( 'object' );
  39.     if( $object instanceof eZContentObject === false ) {
  40.         continue;
  41.     }
  42.  
  43.     $dataMap = $object->attribute( 'data_map' );
  44.  
  45.     if( $key % 100 === 0 ) {
  46.         $memoryUsage = number_format( memory_get_usage( true ) / ( 1024 * 1024 ), 2 );
  47.         $output = number_format( $key / $count * 100, 2 ) . '% (' . ( $key + 1 ) . '/' . $count . ')';
  48.         $output .= ', Memory usage: ' . $memoryUsage . ' Mb';
  49.         $cli->output( $output );
  50.     }
  51. }
  52.  
  53. $executionTime = round( microtime( true ) - $startTime, 2 );
  54.  
  55. $cli->output( str_repeat( '-', 64 ) );
  56. $cli->output( 'Script took ' . $executionTime . ' secs.' );
  57. $cli->output( str_repeat( '-', 64 ) );
  58.  
  59. $script->shutdown( 0 );
  60.  
  61. ?>

After running the script we got the following result:

Source code    
$ php extension/nxc_test/bin/php/fetch_many_nodes.php
----------------------------------------------------------------
Starting script...
----------------------------------------------------------------
0.00% (1/1384), Memory usage: 63.50 Mb
7.23% (101/1384), Memory usage: 63.50 Mb
14.45% (201/1384), Memory usage: 63.50 Mb
21.68% (301/1384), Memory usage: 63.50 Mb
28.90% (401/1384), Memory usage: 63.50 Mb
36.13% (501/1384), Memory usage: 63.50 Mb
43.35% (601/1384), Memory usage: 63.50 Mb
50.58% (701/1384), Memory usage: 63.50 Mb
57.80% (801/1384), Memory usage: 63.50 Mb
65.03% (901/1384), Memory usage: 63.50 Mb
72.25% (1001/1384), Memory usage: 63.50 Mb
79.48% (1101/1384), Memory usage: 63.50 Mb
86.71% (1201/1384), Memory usage: 63.50 Mb
93.93% (1301/1384), Memory usage: 63.50 Mb
----------------------------------------------------------------
Script took 2.95 secs.
----------------------------------------------------------------

If your eZ Publish installation contains tens of thousands of published nodes then most likely this script will fail. And it is because all nodes are fetched together with its data map. Because of this, the executable SQL query will be too large and complicated, and most likely your MySQL server will be overloaded while this query is being executed. So you should think very well if you plan to use LoadDataMap option in the ez publish functions without specified limitation.

Let’s try to run this script with disabled LoadDataMap option:

Source code    
  1. $nodes = eZContentObjectTreeNode::subTreeByNodeID(
  2.     array(
  3.         'Depth'       => false,
  4.         'Limitation'  => array(),
  5.         'LoadDataMap' => false
  6.     ),
  7.     1
  8. );

 

Results:

Source code    
$ php extension/nxc_test/bin/php/fetch_many_nodes_as_object.php
----------------------------------------------------------------
Starting script...
----------------------------------------------------------------
0.00% (1/1384), Memory usage: 22.50 Mb
7.23% (101/1384), Memory usage: 22.50 Mb
14.45% (201/1384), Memory usage: 23.75 Mb
21.68% (301/1384), Memory usage: 26.75 Mb
28.90% (401/1384), Memory usage: 29.00 Mb
36.13% (501/1384), Memory usage: 31.50 Mb
43.35% (601/1384), Memory usage: 33.75 Mb
50.58% (701/1384), Memory usage: 36.00 Mb
57.80% (801/1384), Memory usage: 38.25 Mb
65.03% (901/1384), Memory usage: 40.75 Mb
72.25% (1001/1384), Memory usage: 43.25 Mb
79.48% (1101/1384), Memory usage: 45.75 Mb
86.71% (1201/1384), Memory usage: 48.25 Mb
93.93% (1301/1384), Memory usage: 51.25 Mb
----------------------------------------------------------------
Script took 2.16 secs.
----------------------------------------------------------------

In this case, it is not difficult to see that memory usage increases with each iteration. Again, if your eZ Publish installation contains a lot of published nodes, then this script will probably fail because it does not have enough memory. In order to avoid this kind “memory leaks” you need to clean up the cache and reset objects’  data map at the end of each iteration. This can be done with the following code:

Source code    
eZContentObject::clearCache( $object->attribute( 'id' ) );
$object->resetDataMap();

Each time eZ Publish object is fetched, its data is stored in eZContentObjectContentObjectCache static variable (https://github.com/ezsystems/ezpublish/blob/master/kernel/classes/ezcontentobject.php#L833). eZContentObject::clearCache
clears the eZContentObjectContentObjectCache variable.

Let’s clear cache and reset object data map at the end of each iteration:

Source code    
  1. foreach( $nodes as $key => $node ) {
  2.     $object = $node->attribute( 'object' );
  3.     if( $object instanceof eZContentObject === false ) {
  4.         continue;
  5.     }
  6.  
  7.     $dataMap = $object->attribute( 'data_map' );
  8.  
  9.     eZContentObject::clearCache( $object->attribute( 'id' ) );
  10.     $object->resetDataMap();
  11.  
  12.     if( $key % 100 === 0 ) {
  13.         $memoryUsage = number_format( memory_get_usage( true ) / ( 1024 * 1024 ), 2 );
  14.         $output = number_format( $key / $count * 100, 2 ) . '% (' . ( $key + 1 ) . '/' . $count . ')';
  15.         $output .= ', Memory usage: ' . $memoryUsage . ' Mb';
  16.         $cli->output( $output );
  17.     }
  18. }

And run the script:

Source code    
$ php extension/nxc_test/bin/php/fetch_many_nodes_as_object_clear_memory.php
----------------------------------------------------------------
Starting script...
----------------------------------------------------------------
0.00% (1/1384), Memory usage: 22.50 Mb
7.23% (101/1384), Memory usage: 22.50 Mb
14.45% (201/1384), Memory usage: 22.50 Mb
21.68% (301/1384), Memory usage: 22.50 Mb
28.90% (401/1384), Memory usage: 22.50 Mb
36.13% (501/1384), Memory usage: 22.50 Mb
43.35% (601/1384), Memory usage: 22.50 Mb
50.58% (701/1384), Memory usage: 22.50 Mb
57.80% (801/1384), Memory usage: 22.50 Mb
65.03% (901/1384), Memory usage: 22.50 Mb
72.25% (1001/1384), Memory usage: 22.50 Mb
79.48% (1101/1384), Memory usage: 22.50 Mb
86.71% (1201/1384), Memory usage: 22.50 Mb
93.93% (1301/1384), Memory usage: 22.50 Mb
----------------------------------------------------------------
Script took 2.46 secs.
----------------------------------------------------------------

It’s much better than last time. But still, we can spend even less memory. $nodes array contains eZContentObjectTreeNode objects. If the $nodes array will contain arrays instead of eZContentObjectTreeNode objects, then we need much less memory to store $nodes array. AsObject option allows us to achieve the desired results:

Source code    
  1. $nodes = eZContentObjectTreeNode::subTreeByNodeID(
  2.     array(
  3.         'Depth'       => false,
  4.         'Limitation'  => array(),
  5.         'LoadDataMap' => false,
  6.         'AsObject'    => false
  7.     ),
  8.     1
  9. );
  10.  
  11. $count = count( $nodes );
  12. foreach( $nodes as $key => $node ) {
  13.     $object = eZContentObject::fetch( $node['contentobject_id'] );
  14.     if( $object instanceof eZContentObject === false ) {
  15.         continue;
  16.     }
  17.  
  18.     $dataMap = $object->attribute( 'data_map' );
  19.  
  20.     eZContentObject::clearCache( $object->attribute( 'id' ) );
  21.     $object->resetDataMap();
  22.  
  23.     if( $key % 100 === 0 ) {
  24.         $memoryUsage = number_format( memory_get_usage( true ) / ( 1024 * 1024 ), 2 );
  25.         $output = number_format( $key / $count * 100, 2 ) . '% (' . ( $key + 1 ) . '/' . $count . ')';
  26.         $output .= ', Memory usage: ' . $memoryUsage . ' Mb';
  27.         $cli->output( $output );
  28.     }
  29. }

Results:

Source code    
$ php extension/nxc_test/bin/php/fetch_many_nodes_as_array_clear_memory.php
----------------------------------------------------------------
Starting script...
----------------------------------------------------------------
0.00% (1/1384), Memory usage: 16.50 Mb
7.23% (101/1384), Memory usage: 16.75 Mb
14.45% (201/1384), Memory usage: 16.75 Mb
21.68% (301/1384), Memory usage: 16.75 Mb
28.90% (401/1384), Memory usage: 16.75 Mb
36.13% (501/1384), Memory usage: 16.75 Mb
43.35% (601/1384), Memory usage: 16.75 Mb
50.58% (701/1384), Memory usage: 16.75 Mb
57.80% (801/1384), Memory usage: 16.75 Mb
65.03% (901/1384), Memory usage: 16.75 Mb
72.25% (1001/1384), Memory usage: 16.75 Mb
79.48% (1101/1384), Memory usage: 16.75 Mb
86.71% (1201/1384), Memory usage: 16.75 Mb
93.93% (1301/1384), Memory usage: 16.75 Mb
----------------------------------------------------------------
Script took 2.3 secs.
----------------------------------------------------------------

In this case, memory is not used much less. But the more fetch nodes you have, the bigger the difference is.

I hope that this post was useful to you. Maybe you can suggest in the comments some different ways for performance optimization?

Print this post

6 Responses to "eZ Publish Fetch Functions Optimization"

  1. Nat   on Thursday, May 24

    Hi there!
    Serhey, do you recommend this optimization for a self-created module or existed modules (like overriding)?

    Anyway, the idea of cleaning cashes is good. Thanks!

    (reply)
  2. Serhey Dolgushev   on Friday, May 25

    Hello,
    You should follow described recommendations for heavy fetches in your extensions.
    Unfortunately there is no built in template operator to clear object data map. So you should add your own one for this purposes.

    (reply)
  3. Frode Austvik   on Wednesday, August 15

    Hi,

    Even your last version will probably fail on an installation with a lot of published nodes, because you still haven’t addressed the biggest problem with the code: it fetches the entire list of nodes into a single big array, before looping through any of them.

    This might be the conceptually easiest way to loop through all the nodes, but it doesn’t take much to improve this so that it doesn’t fail even on sites with a huge number of nodes.

    The basic idea is the same as the pagination of e.g. search results, namely using a limit and offset to fetch parts of the result set, processing each chunk separately. The subTree method provides the Limit and Offset parameters for just this purpose.

    That way, you only ever fetch a few objects at a time, without having to keep the entire list in an array, which will allow you to keep memory usage fairly low and constant even with AsObject=true.

    In fact, my limited testing (see below) suggests that keeping FetchDataMap=true works out quite well with this approach.

    Pseudocode (some error checking should probably be added):
    $params=array( … , ‘Limit’ => 100, ‘Offset’ => 0);
    $list = subTree( $params );
    while ( !empty( $list ) ) {
    foreach ( $list as $node ) { … do whatever with the node … }
    clearCache()
    $params[‘Offset’] += $params[‘Limit’];
    $list = subTree( $params );
    }

    This skips the resetDataMap() call because the entire list of objects, including the data maps, will be garbage collected on the next fetch (since $list is reused) – and calls clearCache() outside the inner loop, without parameters, so it simply clears the entire cache instead of each object individually.

    Depending on what exactly you’re fetching, you may need to include a defined ordering of the returned nodes – if the order doesn’t otherwise matter, I’d suggest using the primary key, since there’s always an index on that field in the database, and in ascending order, since that should let you avoid repetitions/missed nodes even when things are published/etc. concurrently.

    To provide some benchmark numbers:

    For comparison, I took your original script and ran it on a moderately large site (12371 nodes), and it took so long before it even started printing memory usage that I gave up and aborted it. With the LoadDataMap=false change, it started out using 119.75MB, and ended up at 285.00MB after 49.27 secs. With cache clearing and AsObject=false, it used 66.00MB throughout, and 48.03 secs.

    Changing it to use my approach, with LoadDataMap=false, AsObject=true and Limit=1000, the memory moved up and down between 34.5MB and 38.00MB (probably due to the varying sizes of the content), and used 59.87 secs. Somewhat slower, but smaller.

    With Limit=100, it only used 10.50MB to 13.50MB, but was a lot slower – 218.54 secs – by far most of which was spent waiting for the database. With Limit=2000, it used 62.00MB to 69.25MB, and 49.15 secs.

    More interestingly, with Limit=1000 and LoadDataMap=true, it used 53.50MB to 57.00MB and 39.8 secs, which is both faster and smaller.

    And, unlike your versions, mine should use about the same amount of memory no matter how many nodes there are in the database, which should make it easier to tune for a specific site.

    (reply)
  4. Serhey Dolgushev   on Tuesday, October 2

    Hi Frode,
    You are right, it is possible to fetch the portion of nodes per each iteration. But there will be memory leaks during each iteration, if you will not use the following code

    eZContentObject::clearCache( $object->attribute( 'id' ) );
    $object->resetDataMap();

    Using nodes portions is just another possible solution. But it`s source code is much complicated – could you please compare your test codes against the last code block in this post?

    There are some calculations. PHP requires about 5 kb to store each node as array in the memory. I have checked it using the following code:

    $start_memory = memory_get_usage();
    $new = unserialize( serialize( $node ) );
    $cli->output( 'Memory usage for one node: ' . number_format( ( memory_get_usage() - $start_memory ) / 1024, 2 ) . ' Kb' );
    unset( $new );

    In this case 100000 * 5 / 1024 = 488 mb memory will be enough to fetch 100k of ez nodes. And the source code will be simple, understandable and linear without handling any additional iterations.

    Interesting facts:
    – PHP requires about 35 kb (AVG) to store each node as object with it`s datamap in the memroy.
    – PHP requires about 8 kb (AVG) to store each node as object without it`s datamap in the memroy.
    – PHP requires about 5 kb (AVG) to store each node as array in the memroy.

    (reply)
    • Frode Austvik   on Tuesday, October 2

      Hmm, I think I see what you mean – I wasn’t really aware of PHP having problems with circular references (fixed as of 5.3 apparently), nor that there were such references in play here.

      However, that only means $object->resetDataMap() should be called for each object, and doesn’t require the other method to be called with the object’s ID like you’re doing – one of my points was that I still include that call, but put it outside the inner loop and without the object ID parameter, as that’s faster and works just as well (by clearing the entire cache instead of removing each reference individually).

      As for 488MiB being enough for 100k eZP nodes, that’s not quite true, due to other overheads (I took the subtree from your last code block, and put memory_get_usage around it – ended up with >5.86 KiB per node for 2401 nodes, which makes it 572 MiB for 100k *if* it’s linear, which it might not be) – and even if it’s true, that’s still quite a lot of memory that could be put to better use (such as serving requests or caching often-used data). And you still need the memory for each node object while processing it.

      By the way – your averages, in particular the one with the datamap, is very data-dependent. As an example, for the site I ran the tests on for this comment (not the previous one), fetching the first 500 nodes with datamaps gave me an average of 130.56 KiB per node.

      Frankly – the code is not much more complicated, especially if you assume just a modicum of skill in the programmer. I didn’t want to clutter the page with another copy of the code, especially since I’m not sure how to get it formatted properly, but I guess I can try:

      $params=array(
      'Depth' => false,
      'Limitation' => array(),
      'LoadDataMap' => true,
      'AsObject' => true,
      'Limit' => 500,
      'Offset' => 0,
      );
      $count = eZContentObjectTreeNode::subTreeCountByNodeID( $params, 1 );

      $nodes = eZContentObjectTreeNode::subTreeByNodeID( $params, 1 );
      $key = 0;
      while ( is_array( $nodes ) && !empty( $nodes ) ) {

      foreach( $nodes as $node )
      {
      $key++;
      $object = $node->attribute('object');
      if( $object instanceof eZContentObject === false ) {
      continue;
      }

      $dataMap = $object->attribute( 'data_map' );
      $object->resetDataMap();
      }

      eZContentObject::clearCache( );

      $params['Offset'] += $params['Limit'];
      $nodes = eZContentObjectTreeNode::subTreeByNodeID( $params, 1 );

      $memoryUsage = number_format( memory_get_usage( true ) / ( 1024 * 1024 ), 2 );
      $output = number_format( $key / $count * 100, 2 ) . '% (' . ( $key + 1 ) . '/' . $count . ')';
      $output .= ', Memory usage: ' . $memoryUsage . ' Mb';
      $cli->output( $output );
      }

      (reply)
  5. Vladn   on Tuesday, June 18

    ,как долго можно принимать припарат Асентр

    (reply)

Reply to Frode Austvik

Get latest news