Wednesday 9 July 2014

Doing a mass export of all data, using FileMaker’s native ExecuteSQL() function and a virtual table.

Recently at work we had a customer request for a facility to do a mass export of all their data.  The reason why such a facility did not already exists requires a little background.

Denbigh Admin” is our primary product.  While we do other work as well, this is our bread and butter.  Denbigh Admin first started out around 1998 as “.fp3” files.  Back then, a FileMaker Database file had exactly one table.  So in order to add modular functionality, new files were created and added as necessary.  Eventually, we exceeded 100 database files in our file set, which got entertaining when you consider that FileMaker Pro clients could only open a maximum of 50 files at a time back then!  When FileMaker Pro 7 came out, the “.fp7” file format allowed multiple tables in the one file, and we conducted some table consolidation around modular functionality (and user privilege areas), reducing to an eventual set of 47 files.  Over time, that has slowly crept up again, and we are now up to 83 files and around 250 tables or so, depending on which version of Denbigh Admin you have, and what customisations were specified.  I forgot to mention, every copy of Denbigh Admin is customised for each and every customer.  No two are identical.

Anyway, back to the customer request.  Because of the background I mention above, there actually isn’t a single facility in Denbigh Admin to export or import all data at once.  Where we have to upgrade a customer’s files, it’s always in specific modules only, and since no two are exactly the same, we never worried about a generic export & import setup.

So one of our developers started doing it the obvious way, going to each file, and building export routines for the tables in those files.  A little way into the process, the job came up for discussion, and I suggested another way, using FileMaker’s native ExecuteSQL() function to get the data, and a “Virtual Table” to export it.  Next question was where in the file set to do this? Use an existing file, or make a new one?  Some time back, we implemented FMDataGuard auditing, but when it was not compatible with FMP13 (yes I know it is now) we removed it in favour of an entirely native auditing solution. That is the topic of another post…  Anyway here we were with an “AuditLog” file with very little code, and only one small table.  A very nice little utility file that we could add functionality to without adding to the file population.

So I added “external data sources” to all the other files in Denbigh Admin, and then added table occurrences to the relationship graph.  Now the trick at this point is to name the table occurrences according to some sort of pattern. We had only the original AuditLog table in the relationship graph to start with, so I decided to name the table occurrences using “<filename>.<tablename>”.  This took care of a couple of potential problems.  The first was that we had a couple of different Finance-related files that had some identically named tables in them, so using this method prevented any potential ambiguity, and the second aspect was to make it easy to identify which table occurrences were there for export purposes, because they contained a “.” in their name.

Next question, how to dynamically detect all such table occurrences?  Andrew Duncan wrote an article about FileMaker’s own “schema tables”,  that can be used to determine the relationship graph contents.  I used it to get the list of table occurrences that have a “.” in their name, like this:

Custom Function
Get_export_table_list =
Let(
sqlQuery = "SELECT TableName FROM FileMaker_Tables
WHERE FileMaker_Tables.TableName LIKE '%.%'"
;
ExecuteSQL ( sqlQuery ; Char(9) ; Char(13))
)

So now I have a list of all the table occurrences where they contain a period “.” in their name.  The next thing is to get a list of the stored data fields for a table occurrence.  I made a custom function for this too, so that I can just pass in the table occurrence name, and get out the field list, like so:

Custom Function
Stored_field_names_for_table (tablename) =
Let([
tabchar = Char ( 9 );
returnchar =  Char ( 13 ) ;
sqlQuery =  "SELECT a.FieldName 
FROM FileMaker_Fields a
WHERE a.Tablename = '" & tablename & "'  AND a.FieldClass = 'Normal' AND a.FieldType NOT LIKE 'global%' AND a.FieldType != 'binary' "
];
ExecuteSQL( sqlQuery ; tabchar ; returnchar );
)

This will return the list of fields classed “normal” so it doesn’t include any calculated fields, and excludes container fields (hence the “!= ‘binary’”), and excludes global fields, since they, by definition, do not contain stored data.

So now we have the tools to discover the tables we want to export, and the fields in those tables that we want to export.  If we want to add another table to the export routine in the future, we only need to add it to the relationship graph, and name it according to the schema above.  No other configuration or scripting necessary.

Halfway there!  Or so I thought….

The initial idea here was to export the data using a virtual table, so I ran through the tables in the graph, and the widest one was the “Current Student” table with 187 stored fields.  I decided to make the virtual table with 200 fields to leave a little leeway.

The virtual table technique is described in lots of places around the Internet, but what it comes down to is making a table where every field is an “unstored calculation” on a variable, and to get data in it, you push that data into the variable. Most techniques use a global variable, and I initially did too, so that the data stayed visible to all scripts, even if I exited the script at any time. Since variables can have any number of “repetitions” we can programmatically use one as a theoretically infinite array.  In practice the limits are set by the machine’s resources, but you’d have to be a very sadistic programmer to overload the machine by populating variables!

Anyway, there are a couple of different ways to go about this, using a global variable “repetition” as the “column”, or using the global variable “repetition” as the “row”.  Initially I figured the column would be the way to go, because I could load up the $$VirtualColumn[n] with a dedicated sql call for each field.  Time for another custom function:

Custom Function
SQLQuery_For_field(tableName ; fieldName)=
let([
tabchar = Char ( 9 );
returnchar =  Char ( 13 ) ;
sqlQuery = "SELECT a.\”” & fieldName & “\” FROM \”” & tableName & ”\” a"
];
ExecuteSQL ( sqlQuery ; tabchar ; returnchar )
)

Putting it all together:

begin script
  set $counter to 1
  set $fieldCount to valuecount ( $fieldList )
  loop
    exit loop if $counter > $fieldCount
    set variable $thisField to getvalue ( $fieldList ; $counter )
    set variable $$VirtualColumn[$counter] to SQLQuery_For_field(tableName ; $thisField)
    set variable $counter to $counter + 1
  end loop
end script

OK!  We’ve got the global variables $$VirtualColumn[n] populated.  Now we go to virtual table, where the fields are unstored calculations like this:

Field001 = getvalue ( $$VirtualColumn[1] ; get(recordnumber) )
Field002 = getvalue ( $$VirtualColumn[2] ; get(recordnumber) )
Field003 = getvalue ( $$VirtualColumn[3] ; get(recordnumber) )
.
.
Field200 = get value ( $$VirtualColumn[200] ; get(recordnumber) )

Now how many records do we need in the Virtual Table?  Even easier! ValueCount ( $$VirtualColumn[1] ) and you’ve got it.  Go to VirtualTable layout make sure there are enough records, and export.

I’ve skipped a step in here, in that the customer’s request was to export the data as Merge files, “.mer”.  This is easy, it’s just a CSV file with the first row of data being the field names.

Anyway, I thought I was home free here.  SQL queries to populate the $$VirtualColumn[n] in a matter of seconds, and then to export the data, set a $fileName variable to the “tablename” & “.mer”, add the path to desktop, and export the whole 200 fields as csv.  If the table being exported has less than 200 fields, the remaining fields are empty, no field name and no data.  No problem!

The more experienced FileMaker Developers out there are going to know what’s coming up next, they’ll have spotted the problem a couple of paragraphs back.

You see when it goes to export the first row of data, it grabs the first value from $VirtualColumn[n] for each field for that row.  Then when it exports the second row of data, it goes to the second value in $VirtualColumn[n], then when exporting the third row of data, it goes to the third value in $VirtualColumn[n] etc.  This is no problem if you only have a few thousand records at most.  But if you have a lot more you’re in trouble, because when it gets to the 10,000th record it’s parsing through 10,000 values in the variable to get the one for that field for that record.  The “Attendance Archive” table had 85,000+ records, it took 5 hours to export.  The Student Outcomes table had 430,000+ records, I wasn’t even going to attempt that!

So, we have to switch to $VirtualRow[n] structure instead of $VirtualColumn[n].  In some ways this is pretty easy. First, yet more custom functions:

Custom Function
Stored_FieldNames_For_Table_As_Field_List (tableName) =
Let([
t1 = Stored_FieldNames_For_Table ( tablename );
t2 = Substitute ( t1 ; "¶" ; "\",a.\"" );
t3 = "a.\"" & t2 & "\""
];
t3
)

Custom Function
SQLQuery_For_Table(tableName) =
"SELECT " & Stored_FieldNames_For_Table_As_Field_List ( tablename ) & "¶FROM \"" & tablename & "\" a"

Then we get the data and load it into a variable

Set $$TableData to ExecuteSQL ( SQLQuery_For_Table(tableName) )
Then parse it out to the $$VirtualRow[n]

begin script
  set variable $counter to 1
  loop
    set variable $recordCount to ValueCount ($TableData)
    exit loop if $counter > $recordCount
    set variable $$VirtualRow[$counter] to Substitute ( getValue ( $$TableData ; 1 ) ; tabchar ; returnchar ) 
    /* so each field in the source data becomes a separate value in the virtual row */
    set variable $counter to $counter + 1
    set $$TableData to RightValues( $$TableData ; $recordCount -1 )
    /* it’s always taking the first row of data */
  end loop
end script

At this point, the Virtual Table has to have a change in it’s calculation for each field to

Field001 = get value ( $VirtualRow[get(recordnumber)] ; 1 )
Field002 = get value ( $VirtualRow[get(recordnumber)] ; 2 )
Field003 = get value ( $VirtualRow[get(recordnumber)] ; 3 )
Field004 = get value ( $VirtualRow[get(recordnumber)] ; 4 )
.
.
Field200 = get value ( $VirtualRow[get(recordnumber)] ; 200 )

Now the maximum number of values the calc engine will need to parse when exporting is 200. However getting that data from $$TableData to $$VirtualRow[n] takes a really long time.  When attempting to do it on the Student Outcomes table, with 430,000+ records, I timed the first few thousand and it was going to take WAY too long to parse it all.

So thinking about it some more, I realised that if I was parsing through the whole pile of data in one big loop, for each iteration it was going to have to parse an average of (ValueCount(all the data) / 2) values. Moreover, the time it was taking to get to a value did not seem to be increasing linearly.  In other words if I wanted the 50th value it took (t) time, but if I wanted the 100th value, it seemed to me to be taking MORE than (t x 2) to do so.  If that theory was correct, then breaking it down into smaller chunks should improve things considerably.So I tried this instead.

Set $$AllTableData to ExecuteSQL ( SQLQuery_For_Table(tableName) )

begin script
  set variable $allRecordCount to ValueCount ($AllTableData)
  set variable $counter to 1
  loop
    exit loop if ValueCount ($$AllTableData) = 0
    set $TableData to leftValues ( $$AllTableData ; 1000 )
    set $AllTableData to if ( $allRecordCount > 1000 ; RightValues ( $$AllTableData ; ValueCount ($$AllTableData) - 1000 ) ; null )
    loop
      set variable $recordCount to ValueCount ($$TableData)
      exit loop if $recordCount = 0
      set variable $thisOne to Substitute ( getValue ( $$TableData ; 1 ) ; tabchar ; returnchar ) 
      /* so each field in the source data becomes a separate value in the variable */
      set $TableData to RightValues( $TableData ; $recordCount -1 )
      /* so that it’s always taking the first row of data */
      set variable $$VirtualRow[$counter] to $thisOne
      set variable $counter to $counter + 1
    end loop
  end loop
end script

So what is this doing?  It’s taking that big whopping pile of $$AllTableData, breaking it into 1,000 row chunks, and processing each chunk.

This architecture parsed all the data of 430,000 records into the $$VirtualRow[n] global variables in well under an hour, and exported it all in the time it took me to go get another coffee.  Wished I’d thought of it sooner!

So what other catches are there?  Well, you may want to clear all the $$VirtualRow variables between each table export.  What’s worse is that you can’t do this with a recursive custom function, the maximum number of times a recursive custom function can call itself is 10,000 times. ().  So you have to make a looping script. Pass in a parameter of how many rows you had in the last export, and loop through setting $$VirtualRow[$counter] to NULL.  (I forgot to mention that I have a custom function called NULL that returns nothing at all, a surprisingly useful function actually).  There is another way to get around this, and that would be to write the parsing and exporting logic in one big script, and use a script variable $VirtualRow[n] instead of a global variable $$VirtualRow[n].  Then when you exit the script, the $VirtualRow[n] variables would all get destroyed for you by the system.  The logic would look like this:

begin script
set $tableList to Get_export_table_list
set $counter to 1
set $tableCount to ValueCount ( $tableList )
loop
  exit loop if $counter > $tableCount
  set $thisTable to GetValue ( $tableList ; $counter )
  Export_data_from_table ( $thisTable )
  set $counter to $counter + 1
end loop

The routine “Export_data_from_table” would then do that actual determining field names, loading and parsing the data, and exporting, all in one script using script variables $VirtualRow[n]

——

So what to take away for this?  A technique to handle mass exports of data using ExecuteSQL and a Virtual Table, that is very generic in it’s operation, requires very little configuration change to suit almost any system, and automatically adjusts for any fields you add or remove from a table.

However, while FileMaker makes working with return delimited values really easy, it is correspondingly easy to program in a very inefficient manner.  If you have to process really large quantities of text values, break it up into smaller pieces first!

2 comments:

  1. Very nice!

    I may have missed something in scanning the whole article, so my next comments may not apply. I have done very similar processes in the past, and if you would consider using a free plugin, ScriptMaster by 360Works, you may be able to simplify this even more.

    By just using the first SQL call to grab the data from each table as the value of a global variable, and then exporting that global variable with a function from ScriptMaster, the job is done. No need for a virtual table or all of the other processing and looping.

    However if you can't or don't want to use a plugin then my suggestion is a mute point!

    ReplyDelete
    Replies
    1. Thanks for the pointer Karstyn, one of my colleagues has done a lot with 360Works plugins but I haven't, and was not aware of that capability. Come to think of it, he might have not thought of it either ;-)

      Delete