Sunday, 12 April 2015

A journey through ranking algorithms.

In Education, and particularly in School Reports, one of the things we have to do pretty often is to calculate a “ranking” of Students.  Sometimes it’s in an Assessment Task, sometimes it’s in Subject Exams, and I have no doubt that there are lots of other situations in and outside of School Admin Systems where this functionality is used.

This is a discussion of the various methodologies I went through over the years (and FileMaker versions of course!) in attempting to improve performance of this functionality. To clarify terminology I am going to use:

A “Rank” is an integer given to a student that provides a place in the student grouping, according to a “Mark”.  If two students have the same “Mark” then they are given the same ranking, and the next lower number is not used.  For instance if 4 students get marks of 99,98,98,97, the rankings would be 1,2,2,4.

A “Course” is a subject that is being undertaken by the students, e.g. Year 10 English.  A “course code” is a unique identifying code, e.g. “10ENG”, that can be used to relate the records to each other.

---

When I first started at Denbigh in 2002, we were using FileMaker 5 ( or 5.5?) at the time.  Our “ranking algorithm” was
1) sort the students by order of “mark”, then
2) post an integer in a “global rank” field, (starting with 1) and post the highest mark in “global mark” field, and then
3) loop through the records. If the mark was the same as in the global mark field, then assign the global rank value to the student’s “Rank” field, otherwise insert the "record number" into both the global rank field, and the student’s rank field, and the mark into the “Global Mark” field.

That algorithm has the advantage of being pretty linear in it’s processing time vs the population.  If you doubled the number of records, it took about twice as long.  The disadvantages were that it did take some time, a user had to push a button to make it happen, and if any Marks were edited the whole thing had to be run again.

Over the years that followed, I went through a couple of different algorithms, principally concerned with making it faster, and eventually making it into an unstored calculation instead.  Making it an unstored calculation then had repercussions in complexity, affecting process time severely in larger populations.  So I’m documenting my journey, arriving at the present method of an unstored calculation that evaluates reasonably fast.

The first improvement I tried required a self relationship.  On the records that carried the mark, I created a self relationship joined by the identifying code (e.g. “course code”) and the “mark” field.  This means that every record that had a mark, could see at least itself through the self join, and any other records that also had the same mark.  I also added another number field, “rankingSerialNumber”.  The algorithm was:
1) Find the records to be ranked.
2) Sort them by the “Mark” field, descending.
3) run a replace command into the “rankingSerialNumber” field, starting with 1, incrementing by 1.
4) run a replace command into the “Rank” field, = min (selfjoin::rankingSerialNumber)

This worked pretty well, instead of a loop through a found set of records, updating global fields and record in each loop, the Rank got derived by two replace commands.  Performance was improved considerably, especially over larger data sets.  It still required scripted operation though, where the user clicked the button.

The ideal would be an unstored calculation.  Along came FileMaker 7, with it’s introduction of “Recursive Custom Functions” providing a way.  If I retain the self-join relationship, this time only on the “Course Code”,  I can sort that relationship by “Mark" in descending order. Then I create a recursive custom function, that given a target relationship::field, and a “Mark Value”, the custom function can just recursively walk down the list of related records. The number of recursive calls needed to reach that “mark” value, is the “Rank”.  In implementation, I made two custom functions, for convenience.

First the recursive one:

GetRankRecursor ( sortedRelationship::searchField ; searchvalue ; n )

if ( 
   getNthRecord ( sortedRelationship::searchField ; n ) = searchvalue 
   ; 
   n 
   ; 
   GetRankRecursor ( searchField ; searchvalue ; n+1 ) 
)

Now the convenience one:

GetRank ( searchField ; searchvalue )

GetRankRecursor ( searchField ; searchvalue ; 1 )

Because the value I am searching for is guaranteed to exist, I don’t have to trap for missing values, and since the relationship is sorted, I’m guaranteed to be accessing them in order.

So to tie it all together the “Rank” field is now an unstored calculation =

GetRank ( CourseRank::Mark ; Mark )

Cool...

-------------

I thought so at the time anyway.  A customer trying to rank all their year 12 Math students brought home the fact that it’s computation complexity is approximately order n squared.  Twice the population did not take twice the time, it took quite a lot more.  I tested on a set of test data with 50 mark values.  Throw 500 marks at it, and you could go to lunch while it computed.

To resolve this, it’s necessary to look at two things :

1) the relationship had to sort it’s related records each and every time the recursive custom function called itself.  Fortunately FileMaker itself seems to be doing some internal cacheing, alleviating this quite a bit, but it’s still there.

2) If I have a test set of 50 marks, then the average number of recursive function calls per student mark was half that, i.e. an average number of recursive calls to rank 50 students could be as high as 25.  Increase the number of students to 500 and the AVERAGE number of recursive custom function calls *per student* could go as high as 250!

No wonder it took a lot longer when a larger data set walked up to it.

Fortunately, FileMaker 8 provided us with variables.

To improve this:
1) The first thing that occurred to me, is that it shouldn’t have to find related records and sort them for each and every student. If I find them once, and then store the ordered set of marks in a variable, then that repetitive part of the algorithm turns into a once-only operation.  If I store them as return delimited values in a variable, I can implement the second improvement.

2) The second improvement that occurred to me was that if I wanted to find my particular value in a set of return-delimited values in a variable, I didn’t necessarily have to recurse through it using a custom function.  I could use FileMaker’s native “Position()” function to tell me where my mark was in that variable’s data, then use PatternCount() to count the number of “returns” in the text up to that point, and I have the rank number.

The sneaky little gotcha with this idea is that the very first value must be a return character. This is because we can't just search for "2", if we did it would return at the position of "22" or "2something".  So we have to search for "<return>searchvalue<return>". You also want to have a trailing return character by itself too, for the same reason.

So now, the calculated ranking field becomes slightly more complicated.  It has to look for it’s value in a valueList, and create that valueList if it doesn’t exist yet.  The value list will have to be in a global variable so that it is visible no matter what script (if any) is running.  Then I find the position of the Mark in that ValueList, and count how many return characters exist "to the left" of that position.

Let([
  $$markValueList = if ( is empty ( $$markValueList ); “¶” & list ( CourseRank::Mark ) & “¶” ; $$markValueList );
  offset = Position ( $$markValueList ; Mark ; 1 ; 1 );
  rankValue = PatternCount ( left ( $$markValueList ; offset ); “¶” ) + 1 
  /* because there is nothing to the left of the highest mark */
];
  rankValue
)

So now I have a calculated field for a student’s rank, and the only significant performance hit is in the sorted relationship “CourseRank” that gives me the sorted list of values.  It only has to do that once though.

Now because this is an unstored calc, it will only be evaluated when it’s needed for display, computation, or print purposes.  The catch here is that some teachers like to see the student ranks as they are entering the marks.  So not only do I refrain from calculating a rank until a Mark value exists, but I must somehow make it recalculate whenever a “Mark” is inserted or edited.  The simplest method would be an auto enter calc on the "Mark" field that causes the $$markValueList to be cleared each time a mark is inserted.  As soon as the global variable is emptied, the very next Ranking calculation will cause it to be regenerated automatically with the new data in it.

One more possible improvement : instead of having the sorted relationship “CourseRank”, an unsorted relationship could be used to gather up the list of “Mark” values.  Not having to sort the records first would mean that it could gather the values much faster, particularly on larger data sets.  This is not of critical interest where the table is very narrow, but if it's a wide table, it becomes much more critical.  This is because FileMaker cannot retrieve the value of a field without first retrieving the whole record, so the wider the table, the more data it has to retrieve from disk before sorting the Mark values.  Of course I still need to sort those values in descending order, but I can use custom functions for that.  One possibility is https://www.briandunning.com/cf/1181 but in this, Google is your friend.  A caveat with loading an unsorted set of values into a global and then sorting them, is that they will be text, not numbers, so you need to watch for thinks like “3 being higher than 22”.

I am somewhat ambivalent about that last improvement : FileMaker sorting records by embedding the sort in the relationship might be just as quick as using custom functions to sort text, depending on the schema.  Worth testing it.  Also, it can get complex, so remember that sometime in the future, someone who has never seen your code before may have to try and figure out what you were doing.  So if you go down that path, make your code as easy to comprehend as possible.

One last thing is the global variable name.  You can use a “hardwired” global variable, e.g. $$courseMarkList, but if so, you must take care to clear it after use, so that if you go on to rank another set of data, your first set of values won’t be there anymore.  I’m not a fan of this approach, it would be very easy to accidentally use the wrong data set.  My preference is to use individual global variables, one for each course.  So if I am ranking the “10ENG” course marks, I would be using a global variable that is something like “$$courseMarkList10ENG”. This is really easy:

Custom Function “Create Global Variable ( variableName ; variableValue)”

evaluate ( “let($$” & variableName & “=\”“ & variableValue & \”;true)” )

Custom Function “Get Variable Value ( variableName )”
=
evaluate (“\”” & variableName & “\””)

So now we have a ranking algorithm that gives live ranking calculations across arbitrary data sets pretty quickly and easily.

No comments:

Post a Comment