Select.GroupBy: What’s New and What’s to Come

Announcing Synergy/DE 12.2.1.1003 Feature Release for OpenVMS

December 17, 2022

YAML Your Way to Rapid Deployment

December 19, 2022

Select.GroupBy: What’s New and What’s to Come

Published by Tate Chamberlain on December 19, 2022

Tags

tech article

Overview

The Select class in Synergy DBL allows developers to access data from Synergy DBMS files using syntax and concepts that are closer to SQL than to the traditional file-based approach. Since its introduction, the Select class has grown to include more SQL-like features. With Synergy/DE 12, we’ve taken another step in that direction by introducing a new class that can be used with Select: GroupBy. This feature was initially introduced in Synergy/DE 12.1 but has been refined in the recent 12.2.1.1003 feature release.

The GroupBy class implements some of the functionality found in the GROUP BY and DISTINCT keywords in SQL. GroupBy enables you to filter results from a Select query so that similar records are treated as groups instead of individual records. It does this by letting you specify one or more fields, which it will use to sort the result set. Any set of records with the same value in those fields will be treated as a single group, and only one record from the group will be returned as a representative of the whole.

Basic Syntax The Synergy GroupBy class is derived from the OrderBy class, but the syntax is a bit different. Instead of using the Ascending and Descending methods, you can use a constructor and specify one or more fields to group by, similar to the Sparse class. For example

new GroupBy(firstname)
new GroupBy(lastname, firstname)

If you want to specify multiple fields, you can combine multiple GroupBy objects with the .AND. or && operators or just list multiple fields in the same call to Ascending() or Descending(). You can even combine both forms. For example:

grpobj1 = new GroupBy(firstname,lastname)
grpobj2 = new GroupBy(middleinitial)
grpobj3 = grpobj1.and.grpobj2
grpobj4 = new GroupBy(firstname,lastname) && (new GroupBy(middleinitial))

In any case, the leftmost field has the highest priority, and the rightmost field has the lowest priority. For grpobj3 and grpobj4 in the example above, records would be sorted first by firstname, then by lastname, and finally by middleinitial.

Ordering is an inherent concept in GroupBy: the returned groups will always be sorted by the specified fields in ascending order.

The constructor generates a GroupBy object, which can either be used inline in a Select constructor or assigned to a variable for later use. For instance:

selobj = new select(sprsobj, fromobj, new GroupBy(firstname,lastname))

         grpobj = new GroupBy(firstname,lastname)
selobj = new select(sprsobj, fromobj, grpobj)

Whenever you use a GroupBy object in a Select query, you must also use a Sparse object. The reasons for this are described in the Sparse section below. For now, the examples will use “Sparse(“*”)”, which selects all fields. For example:

selobj = new select(new Sparse("*"), fromobj, new GroupBy(lastname))

Basic Example

To see this in action, consider the following example:

structure employee
    key,           d5
    ,              a1
    firstname,     a15
    ,              a1
    lastname,      a15
    ,              a1
    department,    a15
    ,              a1
    salary,        d7
endstructure

import Synergex.SynergyDE.Select
.main
record
    sprsobj, @Sparse
    fromobj, @From
    selobj,  @Select
    rec1,            employee
record
    count,         i4
    salarysum,     d8
    salaryavg,     d7

.define FILE "DAT:employees.ism"
.proc
    xcall flags(7000000,1)
    open(1, o, "TT:")
    xcall isamc(FILE, ^size(rec1), 1,
&               "START=1, LENGTH=5, NAME=""Key1"", DUPS, ASCEND, ATEND")
        open(2, u:I, FILE)
        store(2, "00001 Andrew          McDonald        Development     0025010")
        store(2, "00002 Bob             Brown           HR              0030020")
        store(2, "00003 Beth            Jones           HR              0035030")
        store(2, "00004 Cathy           Brown           Sales           0040040")
        store(2, "00005 David           SMITH           HR              0045050")
        store(2, "00006 Fred            Smith           QA              0050060")
        store(2, "00007 George          Cartwright      Shipping        0055070")
        store(2, "00008 Henry           Smith           Development     0060080")
        store(2, "00009 Jennifer        Jones           QA              0065090")
        store(2, "00010 James           Mcdonald        Development     0070000")
        store(2, "00011 John            Cartwright      Development     0075010")
        store(2, "00012 Laura           Jones           QA              0080020")
        store(2, "00013 Matthew         Cartwright      Shipping        0085030")
        store(2, "00014 Phillip         Brown           Sales           0090040")
        store(2, "00015 Rose            Jones           Sales           0095050")
        close 2
        open(2, I:I, FILE)
        fromobj = new from(2, rec1)
        writes(1, "")
        writes(1, "OrderBy Department:")
        writes(1, "KEY   FIRSTNAME       LASTNAME        DEPARTMENT      SALARY")
        selobj = new select(fromobj, OrderBy.Ascending(rec1.department))
        foreach rec1 in selobj
        begin
          writes(1,rec1)
        end
        writes(1, "")
        writes(1, "GroupBy Department:")
        writes(1, "KEY   FIRSTNAME       LASTNAME        DEPARTMENT      SALARY")
        sprsobj = new Sparse("*")
        selobj = new select(sprsobj, fromobj, new GroupBy(rec1.department))
        foreach rec1 in selobj
        begin
          writes(1,rec1)
        end
        writes(1, "")
        close 2
        close 1
        stop
.end

This program creates an ISAM file containing an example table of employees. It runs a Select query with OrderBy to retrieve the full table sorted by department:

OrderBy Department:
KEY   FIRSTNAME       LASTNAME        DEPARTMENT      SALARY
00001 Andrew          McDonald        Development     0025010
00008 Henry           Smith           Development     0060080
00010 James           McDonald        Development     0070000
00011 John            Cartwright      Development     0075010
00002 Bob             Brown           HR              0030020
00003 Beth            Jones           HR              0035030
00005 David           Smith           HR              0045050
00006 Fred            Smith           QA              0050060
00009 Jennifer        Jones           QA              0065090
00012 Laura           Jones           QA              0080020
00004 Cathy           Brown           Sales           0040040
00014 Phillip         Brown           Sales           0090040
00015 Rose            Jones           Sales           0095050
00007 George          Cartwright      Shipping        0055070
00013 Matthew         Cartwright      Shipping        0085030

It then runs the same query again, using GroupBy instead of OrderBy:

GroupBy Department:
KEY   FIRSTNAME       LASTNAME        DEPARTMENT      SALARY
00001 Andrew          McDonald        Development     0025010
00002 Bob             Brown           HR              0030020
00006 Fred            Smith           QA              0050060
00004 Cathy           Brown           Sales           0040040
00007 George          Cartwright      Shipping        0055070

In the second query, all of the records with a certain value for department are treated as a single group, and only one record is returned for each group.

Suppose we changed the OrderBy and GroupBy objects to use both rec1.department and rec1.lastname fields instead of just rec1.department:

selobj = new select(fromobj, OrderBy.Ascending(rec1.department, rec1.lastname))
selobj = new select(sprsobj, fromobj, new GroupBy(rec1.department, rec1.lastname))

The OrderBy query would return the same results as before, but in a slightly different order:

OrderBy Department and Lastname:
KEY   FIRSTNAME       LASTNAME        DEPARTMENT      SALARY
00011 John            Cartwright      Development     0075010
00001 Andrew          McDonald        Development     0025010
00010 James           Mcdonald        Development     0070000
00008 Henry           Smith           Development     0060080
00002 Bob             Brown           HR              0030020
00003 Beth            Jones           HR              0035030
00005 David           SMITH           HR              0045050
00009 Jennifer        Jones           QA              0065090
00012 Laura           Jones           QA              0080020
00006 Fred            Smith           QA              0050060
00004 Cathy           Brown           Sales           0040040
00014 Phillip         Brown           Sales           0090040
00015 Rose            Jones           Sales           0095050
00007 George          Cartwright      Shipping        0055070
00013 Matthew         Cartwright      Shipping        0085030

But the GroupBy query would return additional records that weren’t present in the original:

GroupBy Department and Lastname:
KEY   FIRSTNAME       LASTNAME        DEPARTMENT      SALARY
00011 John            Cartwright      Development     0075010
00001 Andrew          McDonald        Development     0025010
00010 James           Mcdonald        Development     0070000
00008 Henry           Smith           Development     0060080
00002 Bob             Brown           HR              0030020
00003 Beth            Jones           HR              0035030
00005 David           SMITH           HR              0045050
00009 Jennifer        Jones           QA              0065090
00006 Fred            Smith           QA              0050060
00004 Cathy           Brown           Sales           0040040
00015 Rose            Jones           Sales           0095050
00007 George          Cartwright      Shipping        0055070

This is because the GroupBy query is now using both fields to determine groups. Records in the new query will only be considered as part of the same group if they have identical values for both department and lastname, instead of just department. If multiple employee records in the same department have the same lastname value, they’ll still be part of the same group, but different last names will result in separate groups.

For the sake of simplicity, the examples here select all records from the data file. But GroupBy will also work with more complicated queries that include a Where expression.

Case Sensitivity

You may have noticed from the previous example that there appear to be two groups with a department value of “Development” and a lastname value of “McDonald”. This is because the default behavior for OrderBy and GroupBy is to use case-sensitive comparisons when sorting records. So “McDonald” and “Mcdonald” are distinct values and result in separate groups in GroupBy. This is easy to see if we group by lastname only:

GroupBy Lastname:
KEY   FIRSTNAME       LASTNAME        DEPARTMENT      SALARY
00002 Bob             Brown           HR              0030020
00007 George          Cartwright      Shipping        0055070
00003 Beth            Jones           HR              0035030
00001 Andrew          McDonald        Development     0025010
00010 James           Mcdonald        Development     0070000
00005 David           SMITH           HR              0045050
00006 Fred            Smith           QA              0050060

If you want to ignore differences in case, you can use the NoCaseGroupBy class instead of the regular GroupBy:

NoCaseGroupBy Lastname:
KEY   FIRSTNAME       LASTNAME        DEPARTMENT      SALARY\
00002 Bob             Brown           HR              0030020
00007 George          Cartwright      Shipping        0055070
00003 Beth            Jones           HR              0035030
00001 Andrew          McDonald        Development     0025010
00005 David           SMITH           HR              0045050

We’ve also added the same functionality to OrderBy in the new NoCaseOrderBy class. Note that NoCaseGroupBy and NoCaseOrderBy are only intended to be used with alpha fields and may cause problems if used with any other data type. If you want to use case-insensitive comparisons for some fields and not others, you can create multiple objects and combine them. For example:

new NoCaseGroupBy(alpha_field) && (new GroupBy(int_field))

Sparse

A Select query using GroupBy will return a record from the data file to represent each group as a whole. Because this is a regular record, we can expect it to contain data in all of its fields, not just the fields used to determine the group. But these fields will probably be meaningless for use in representing the group. We can see this by going back to the “GroupBy(rec1.department)” example:

GroupBy Department:
KEY   FIRSTNAME       LASTNAME        DEPARTMENT      SALARY
00001 Andrew          McDonald        Development     0025010
00002 Bob             Brown           HR              0030020
00006 Fred            Smith           QA              0050060
00004 Cathy           Brown           Sales           0040040
00007 George          Cartwright      Shipping        0055070

The only meaningful field in that data is the one used to determine the groups: department. The fact that the employee record returned for the Development department has a name of “Andrew McDonald” doesn’t necessarily tell us anything about the Development department. And the fact that the records for the HR and Sales departments both have a lastname value of “Brown” doesn’t necessarily indicate any similarity between those departments.

It would be helpful to limit the returned data to fields that actually matter. In a SQL query, this could be done by writing a query to only return the fields you care about. In Synergy’s Select class, the fields used by the returned data are determined in advance by the record definition. But you can use the Sparse class to specify the fields that you want returned, leaving all other fields empty. For example:

selobj = new select(new Sparse(rec1.department), fromobj,
&                   new GroupBy (rec1.department))

which results in

GroupBy Department (Sparse):
KEY   FIRSTNAME       LASTNAME        DEPARTMENT      SALARY
                                      Development
                                      HR
                                      QA
                                      Sales
                                      Shipping

The primary use case for Sparse is to improve performance when accessing data over xfServer by reducing the amount of data sent over the network. Our documentation warns against using Sparse when you aren’t also using xfServer, as it may not improve performance. But in the case of GroupBy, Sparse can still be useful regardless of any performance impact.

Additionally, because GroupBy now uses a static cursor to sort and return results, Sparse may improve performance even for local files. A static cursor reads all matching records from a data file into memory before processing (e.g., sorting and grouping) and returning them. Using Sparse to reduce the number of returned fields will reduce the amount of data that needs to be kept in memory, which can reduce memory usage and improve performance.

For these reasons, a Sparse object is required for all queries that use a static cursor, such as all GroupBy queries. We strongly recommend using Sparse to limit a GroupBy query to fields that are actually meaningful, but it is still possible to include additional fields, or even to specify “Sparse(“*”)” and include all fields regardless of relevance.

Aggregate Values

While the non-group fields of any given GroupBy record may not be meaningful individually, it may be helpful to use these fields to describe a group as a whole. For instance, it may not be useful to see the salary of the employee record returned for the QA department, because the record might not be representative of the department. But if you could see the average salary for everyone in the department, or the total amount of money the department spends on payroll, that information could be useful.

This brings us to a primary use case of GROUP BY in SQL: aggregate values. In addition to the actual fields from a table, SQL allows you to select aggregate functions to be performed for all fields in a group, such as Sum, Count, Average, Minimum, and Maximum.

At present, the Synergy GroupBy class doesn’t have aggregate functions built in. But it’s possible to calculate aggregate values in code using multiple Select objects. The first Select object can use GroupBy to retrieve distinct values for specific fields. Then a second Select object can use those values to drive a Where expression and iterate over all records in that group. The code can use those records to calculate any aggregate values.

Advanced Example

To see an example of aggregate values, you can add this code after the existing Select objects in the test program:

writes(1, "GroupBy Department with aggregate values:")
            writes(1, "DEPARTMENT         COUNT     AVG(SALARY)  SUM(SALARY)")
            selobj = new select(new Sparse(rec1.department), fromobj,
&                               new GroupBy(rec1.department))
            foreach rec1 in selobj
            begin
                data selobj2, @Select
                data rec2, employee
                open(3, I:I, FILE)
                count = 0
                salarysum = 0
                salaryavg = 0
                selobj2 = new Select(new From(3,rec2),
&                                    (Where)(rec2.department == rec1.department))
                foreach rec2 in selobj2
                  begin
                      count += 1
                      salarysum += rec2.salary
                  end
                salaryavg = salarysum/count
                writes(1,rec1.department + "    " + %string(count,"XX") +
&                      "    " + %string(salaryavg,"$$$,$$$,$$$") +
&                      "    " + %string(salarysum,"$$,$$$,$$$"))
                close 3
            end

After running this program, you should get output like this:

GroupBy Department with aggregate values:
DEPARTMENT         COUNT     AVG(SALARY)  SUM(SALARY)
Development        04        $57,525      $230,100
HR                 03        $36,700      $110,100
QA                 03        $65,056      $195,170
Sales              03        $75,043      $225,130
Shipping           02        $70,050      $140,100

This code uses GroupBy to return individual records for each department. Within the outer FOREACH loop, another Select object retrieves all records that match each group and keeps a running total of the number of records and the salary for all records in the department. After it’s done reading the records, the code calculates the average salary for the department and displays the data to the screen.

A downside to this method is that there’s no built-in way to sort groups based on an aggregate value. The order of groups is based on the field used in the GroupBy, and any ordering based on aggregate values would need to be done after the initial Select is finished and all aggregate values are calculated.

Limitations

At the time of this writing, the current syntax for GroupBy (introduced in 12.2.1.1003) can only be used when targeting the latest runtime. If you’re building a Synergy program from the command line, this means you must include an option like “-qrntcompat=120301” in the compiler and linker commands. If you’re building from Visual Studio, you can select a Target Synergy runtime of “Latest” in the Build properties for the project. (The original syntax introduced in 12.1 that uses Ascending() and Descending() methods, similar to the OrderBy class, can be used when targeting the default 120101 runtime level, but that syntax is no longer recommended.)

GroupBy requires compiler and runtime features that aren’t present in earlier versions of Synergy. You won’t be able to compile code that uses GroupBy if you target an earlier Synergy version. (In other words, if you use the -qrntcompat compiler and linker option, you can’t specify a version number prior to 12010100.) Likewise, if you’re accessing data over xfServer, the server must be running at least version 12 for GroupBy to work. By default, an xfServer client of version 12.2.1.1003 or later can only connect to a server of version 12.2.1.1003 or later, although you can use the SRV_COMPAT environment variable to override this and connect to a server with an earlier version.

At present, GroupBy is not supported with a JoinSelect. Queries using GroupBy must operate on a single file only. Likewise, GroupBy and OrderBy are not supported on the same query, although results returned by GroupBy will be sorted in ascending order.

GroupBy uses a static cursor and so operates on a snapshot of the data file at the time when the enumerator is created. As a result, it’s possible for the underlying data to change between the time when the Select statement starts working and the time when it retrieves any given record. This shouldn’t change the order of any groups, but it could result in new groups being skipped if records belonging to that group are only added after the Select has started working.

Because each result returned by a GroupBy query represents a group of records instead of a single record in a file, several Select features that require individual records won’t work. For instance, the RestrictedAlphaEnumerator methods CurrentRFA and DeleteCurrent are incompatible with GroupBy. A GroupBy result will have no change tracking information, so the GetCTInfo property is also incompatible. Record locking is incompatible with a GroupBy query, and a GroupBy query won’t execute if locks are enabled in the From object, so the IsLocked property is incompatible as well. The SparseUpdate method is also incompatible, because it can’t be used on a query with a Sparse object, and Sparse objects are required when using GroupBy.

In some circumstances (e.g., if it’s sorting by a single key that’s already in the correct order in the ISAM file), OrderBy can optimize a query by bypassing the process of sorting records and returning them in the correct order directly from the file. GroupBy doesn’t support this but will always sort the result set regardless of what fields are specified. This may result in slower performance than would otherwise be possible for some queries.

As mentioned above, aggregate functions are not currently supported with GroupBy.

Future Plans

While future plans have not been finalized, we’re considering addressing many of the above limitations in future releases of Synergy/DE. One of the biggest improvements we’d like to make in the near future is to support GroupBy with JoinSelect, similar to the support we recently added to use OrderBy with JoinSelect. We’d also like to support OrderBy and GroupBy on the same queries.

In the longer term, we’d like to support aggregate functions with GroupBy (Sum, Count, Average, Minimum, and Maximum), bringing feature parity closer to SQL. Ideally, this feature would allow arbitrary mapping of grouped and aggregated fields from a source record into a destination record, with arbitrary sorting of the results. This mapping could also be useful in other contexts besides GroupBy.

Another potential change is to optimize GroupBy, perhaps by returning records directly from memory after sorting them, instead of reading records from the ISAM file when retrieving them. We might also have GroupBy imply Sparse, so results aren’t cluttered by meaningless records and performance over xfServer is improved, without requiring an explicit definition of a Sparse object.

These changes will likely require new versions of the compiler, runtime, and xfServer, so backwards compatibility is one limitation we wouldn’t be addressing.

Conclusion

I hope this article has given you a good introduction to the GroupBy class. You should have some idea of how the class can be used, both theoretically and with practical code examples. You should also have a basic understand of the differences between the Synergy GroupBy class and SQL, as well as the current limitations of the class, but also some insight into how the class may develop in the future.

Announcing Synergy/DE 12.2.1.1003 Feature Release for OpenVMS

YAML Your Way to Rapid Deployment

Leave a ReplyCancel reply

SOLUTIONS

PRODUCTS

SUPPORT

LEARNING

COMPANY