Realtime Cloud Storage
Query and Scan

  • Feb 27, 2015
  • Starting Guide

When you search for data in a Realtime Cloud Storage table the SDK will use one of two available operations: Query or Scan.

Query

A Query operation searches only key schema attribute(s) values using indexes and supports a subset of comparison operators on key schema attribute values to refine the search process. A query returns all of the item data for the matching primary keys (all of each item's attributes) up to 1 MB of data per query operation.

A Query operation always returns results, but can return empty results. Query results are always sorted by the secondary key. If the data type of the secondary key is Number, the results are returned in numeric order; otherwise, the results are returned in order of ASCII character code values. By default, the sort order is ascending. To reverse the order use the desc() function of the TableRef class (see http://storage-public.realtime.co/documentation/javascript/1.0.0/realtime.storage.tableref.html)

Scan

A Scan operation examines every item in the table. You can specify filters to apply to the results to refine the values returned to you, after the scan has finished. Realtime Cloud Storage puts a 1 MB limit on the scan (the limit applies before the results are filtered). A Scan can result in no table data meeting the filter criteria.

Generally, a Query operation is more efficient than a Scan operation, since as a table grows, the Scan operation slows. Also as the Scan operation examines every item for the requested values, it can use up the provisioned throughput for a large table in a single operation. For quicker response times, design your tables in a way that you can use Query operations, that is, always try to search items based on their primary key, especially at the hot parts of your application (the ones with more requests).

An example

If you have a ProductCatalog table in which each item is a product with an attribute describing the Category it belongs to, if you need to filter the products belonging to the Sports category you would do the following search using the JavaScript SDK (it would result in a scan operation being performed):

filter = {item: "Category", value: “Sports” };
tableRef = storageRef.table(“ProductCatalog”).equals(filter);
tableRef.getItems(
  function(item) { 
    if (item) 
      console.log(item.val()); 
  },
  function(error) { 
    console.log(“Error searching items:”, error);
  }       
);

It would work fine, but if the ProductCatalog table had millions of items this scan operation could become slow and if this search was often requested by the application users the throughput provision would have to be increased in order to have high-performance results.

Creating a table ProductsByCategory with Category as the primary key, ProductID as the secondary key along with the remaining product attributes (e.g. price, color, …), you could do the following search to obtain the products of the Sports category as a Query operation instead of a Scan operation:

filter = {item: "Category", value: “Sports” };
tableRef = storageRef.table(“ProductsByCategory”).equals(filter);
tableRef.getItems(
  function(item) { 
    if (item) 
      console.log(item.val()) 
  },
  function(error) { 
    console.log(“Error searching items:”, error); 
  }       
);

Because you are now searching using the table primary key (Category), the Realtime Cloud Storage will be able to perform a Query operation without scanning the entire table, rendering a predictable search performance whether the table has 1.000 items or 100 million items.

A note about denormalization

Obviously with this table strategy when you add a new product you’ll need to add an item with the products attributes in both tables, ProductCatalog and ProductsByCategory. The same would happen when you update a product, but in order to have a fast and scalable application you sometimes need to denormalize your data, duplicating it. Duplicating data like this can be counter-intuitive, especially for a relational database developer, but in order to build truly web scalable applications, denormalization is almost a requirement. In fact you are optimizing your data reads by writing extra data at write-time. In this example product inserts and updates are far less frequent than searches made by users. Disk space is cheap these days, but a user's time is worth gold.

Back to Applying filters or proceed to Real-time Notifications

If you find this interesting please share: