Importing and managing a large data set in fizz
Abstract
This article11Thanks to Robert Wasmann (@retrospasm) for providing feedback and reviewing this document. looks at how fizz (version 0.3 and up) deals with importing a large set of data for inferences, and explores possible optimizations for faster loading and statement retrieval time. In it, we will be using the Escherichia coli (abbreviated as E. coli) genome retreived from the ecogene.org website. We will also build the procedural knowledge needed to find out how many genes contains the famous GATTACA sequence in the E. coli genome. The completed application can be found in the etc/articles/e.coli folder of the fizz ’s distribution.
Prerequisite
A basic understanding of the concepts behind fizz is expected from the reader of this article. It is suggested to read the introductary article Building a simple stock prices monitor with fizz 22available on the web site first or at least read section two to four of the user manual for an overview of the language and runtime usage.
Importing the raw DNA sequences
The entirety of the E. coli genome can be downloaded from http://www.ecogene.org/ in the FASTA text format where the whole genomic DNA sequence is split over a large number of
lines (of a known maximum length). We can use the import.txt command, to instruct fizz to process that file, and generate a statement for each line that it contains.
To get started, let’s try the command by extracting the first 10 lines from the file. We will first use the spy command to get fizz to show us the generated statements:
Since the import.txt primitive doesn’t assert the statements it generates (but declare them), we need to create an elemental object which will be doing that for us. For that, we create a fizz file called import.fizz and set up a single procedural knowledge definition which we’ll call import.frag. In it we will set up a single prototype which will be triggered by any line.f statements, and for now we will just output the line identifier to the console:
We will now restart fizz and load the file we just created. Next use the import.txt command as we did just above:
Now, since the statements that are being declared (broadcast in the substrate), and we would like them to be stored in memory, we need to modify the procedural knowledge to assert them using the assert primitive:
Re-running our code just as above won’t yield any visible difference; however, if we run the console command /stats we can see that there are now ten statements in the substrate:
We can also call the list console command to check that a new elemental has been created by fizz to collect all the statements we asserted:
In order to avoid having to re-import the whole genome each time, we’re going to save the elemental to disk with the save console command. We will make sure to indicate that we only want to save the frag elemental, since that’s the end result of the import:
The contents of frag.fizz we just created shouldn’t be much of a surprise:
Now, we could import the entirety of the E. coli genome this way; however, this file is rather large (61890 lines) and so the runtime cost of asserting
each of the 61889 statements is prohibitively high. Instead, we are going to replace the primitive assert by bundle as it allows
for statements to be bundled into a single procedural knowledge which will be injected on the substrate, thus creating a new elemental
to handle it.
Let’s modify import.fizz to use that primitive and instruct it to split all the statements into a bundle of 1024 statements:
Note that at the end of the prototype we have added a call to the primitive hush. This will make sure that the completion of the inference will not result in thousands of import.frag statements being published, since this is unnecessary. This will give us slightly better performance. If we now restart fizz , and import the first 2048 lines, we will get 2048 statements split over 2 elementals:
If we were to import the whole file, then 61 elementals will be created on the substrate to handle them, which is a little excessive, so we are going to change the bundle size we provide to the bundle primitive to 3072, and we will then get 22 elementals:
Finally, we will save the statements into a fizz file, so that we can later reload it::
Since it’s a rather large file (5682972 bytes) for fizz to parse, don’t expect the loading time to be stellar:
As we are not actually transforming the statements we read from the text file, we can use a special mode of the import.txt command
to bundle the statements over multiple elemental objects like we did above with the advantage of a much better runtime performance since we won’t be executing any inference for
each statement.
This is done by adding a list of flags (symbols) as the last term of the call. Here we will ask the import.txt command to bundle the statements by 3072, and to spawn a new elemental for each bundle:
Importing the genes descriptions
The second data set we are going to import into fizz is a collection of all the identified genes for the E. coli genome. The set is in a CSV formatted document (with
tabulation as separator), so we will be using the command /import.csv. Unlike the DNA data, we are going to have to do some transformation on each
of the statements that will be extracted, which means it won’t be able to make the import as fast as for the sequences. Hopefully there’s only 4504 lines
to be processed in that file.
Each of the lines from the CSV file contains the following 12 elements, some of which we will be ignoring:
EG | EcoGene Accession Number. |
ECK | K-12 Gene Accession Number. |
Gene | Primary Gene Name. |
Syn | Alternate Gene Symbols. |
Type | Genotype. |
Len | Sequence length. |
Orientation | Orientation (Clockwise, Counterclockwise). |
LeftEnd | Genomic Address, left end. |
RightEnd | Genomic Address, right end. |
Protein | Protein description. |
Function | Known function. |
Description | Description. |
Comments | Comments. |
Let’s start by adding to the import.fizz file we have already been using a new procedural knowledge definition with a single prototype, which when triggered by any statements published by the import.csv command will print to the gene’s identifier to the console:
We can then go ahead and test that by importing the first 10 lines from the file:
Because the import.csv command doesn’t automatically convert strings into symbols, we are going to have to do so for all the terms that we wish to be handled as symbols. For that we will use the primitive str.tosym to convert the following terms: EG, ECK, Type, and Orientation:
Ideally, we would have like to also convert the gene identifier to a symbol (e.g. hisM). Unfortunately some of them contains unsuitable characters (e.g. rhsE') so we will leave them as strings. Next, we will use primitive str.tokenize to transform the fourth term (Alternate Gene Symbols) from a string to a list of strings, as that field uses a comma to separate the symbols:
If we run the import again, then we can make two observations which will improve the representation of the data:
the first is that when the string is "None", then we should be using an empty list. The second observation is that we should trim each of the strings, since we can see instances where extra space shows up (e.g. " fruF"). To that end, we’re going to add two procedural knowledge definitions to our import.fizz file to perform that transformation. The first one, which we will call clean.list, handles the trimming of any strings in a list. It works recursively (like most things in fizz ), and uses the primitive str.trim for the actual trimming of the strings:
As a reminder, the caret that you see on the first two prototypes indicates that if the entry point does unify, the solver shouldn’t try to use any of the following prototypes. Now, the second bit of procedural knowledge we are going to add is the one we will be directly calling in import.gene. It simply either matchs ["None"] to an empty list, or calls clean.list:
We can now add the transformation of the fourth term to import.gene:
If we now perform the test import again, then we get a much better result:
Now, we will perform the same transformation for the 10th and 11th terms, but before that we’re going to make a small addition to import.gene.clean to handle the fact that the string can have a value of "Null", which we will handle like we did for "None":
Lastly, we just need to assert a statement for each gene we are importing. To speed things up we’re going to use the bundle primitive like we did earlier with a bundle size of 2048:
We are now ready to import the whole content of the file. Note that depending on the runtime settings, and your system’s performance, the number of elementals that will be spawned may be different and they may not all have exactly 2048 statements in it. We will then save the statements into a fizz file:
Optimizing loading time
Now that we have transformed both raw data files into the form of factual knowledge representation that fizz can easily manipulate, we are going to look at how to speed-up the loading time. Let’s first assess the loading time by loading both files when starting fizz :
To speed-up the loading time we are going to take advantage of a native feature of fizz - the runtime will try to load each specified file
concurrently. This means we need to split any large file into multiple files which can then be loaded concurrently (fizz is setup to use up to half of
the cores it is enabled on to parallelize the loading).
We are going to start by the frags.fizz file since it contains over 60000 statement definitions spreads over 21 factual knowledges. Using the primitive
fzz.lst (which can only be called as an offload), we are going to obtain the GUID of all these elementals and group them, so that we can then save each group into a separate file. To
do that, we first need to add to the import.fizz (which we have been working on for a while now) some new procedural knowledges that will allow us to break a
large list into small sub-lists.
For that, we first declare a procedural knowledge which we will call lst.split which splits a list into two based on an arbitrary number of elements to be included in the first list:
We will then use it in another procedural knowledge (which we call lst.break) as follows:
We’ll quickly check that they both work as expected:
By combining lst.break, and fzz.lst, we are able to split all the elementals (using their GUIDs) into sub-lists:
From there, we will get every sub-list, make up an appropriate filename for it, and use the console.exec primitive to execute the command save which when given a list of GUIDs will save the identified elementals into the same file:
We can now restart fizz and this time use the seven files containing the DNA fragments instead of the single file:
We now have a much better loading time.
Optimizing retrieval time
Once all the DNA fragments are loaded, let’s see what sort of performance we can get when retrieving a particular fragment based on its identifier:
As we have spread all the DNA fragments over 21 elementals, when we query the runtime for a specific one, the query is sent concurrently to all. While
this is faster than if we had all the sequences in a single elemental, there’s one simple thing we can do to improve the performance: indexing the
statements in each elemental based on the first term. This will allow for a faster retrieval of any statements when one of the indexed
terms is bound to a value in a query.
To do that, we are going to instruct each of the elementals to setup an index using the poke console command:
This gives a value of 0 (the position of the term to be use as the indexing key in the statements) to the index property of all the elemental objects labeled frag. We can now check if this has improved the query performance:
That’s much better. Note that once an index has been created, the elemental will maintain it when statements are added or removed, so you do not have to
poke at it after changes. Multiple indexes are also supported (using a list of indexes).
In order to avoid having to poke each time we load the DNA sequences, we are going to once again save all the frag elementals to file. We will use a different filename so as to keep the non-indexed version:
Alternatively, we could have edited each of the fizz files with a text editor and manually added the index property for each of the knowledge definitions - such as this:
We will complete this section by indexing the genes data using the gene’s name as the indexing term (the third term in each statements):
Finally, we also saved the newly modified elementals into a different fizz file.
Finding the famous GATTACA in the genes
To conclude this article, let’s look at something a little more fun: finding all the genes whose DNA contains at least one occurance of the famous GATTACA
string (from the 1997 sci-fi movie of the same name). To that end, we are going to have to write some procedural knowledge which, when given a gene’s name,
will retreive the complete DNA sequence, and check if it contains a given substring.
To start, create a new fizz file called base.fizz. We will first write a way for us to, given a gene’s name, retrieve the offset and length (in base-pairs) of its DNA sequence, as well as the orientation of the sequence (as we will see later - it matters!):
The first thing we do, is to query (line 3) the gene factual knowledge for any statements matching the gene’s name. Since we only care about a
few of the terms, we use the wildcard variable for most of the terms. Once we have the start and end base-pairs, we compute the length of the
sequence (line 4 and 5) by substracting the start and end offset, and then adding 1 to it. The result is unified with the variable length which we will return.
We end (line 6) by substracting 1 from the start offset, as the base-pair offset starts at 0 for us instead of 1.
Let’s give this a try:
Next, we are going to assemble the complete gene sequence from an offset, and length, and this is going to be a little bit more tricky. First, we will need to find out which of the 60000+ fragments contains the start of the gene (based on the starting offset we retreived earlier). The following procedural knowledge implements this:
It relies on the fact that each of the fragments (except the last one) contains 75 characters (base-pairs) to compute the ID of the first sequence (using div.int) as well as the actual offset within that sequence using the mod primitive. If we combine that new procedural knowledge with the frag one, we can retrieve the very first fragment from which we will still have to extract the relevant part:
Once we have the starting sequence, we are going to write the procedural knowledge that will assemble a complete sequence given the ID of the first fragment, the offset in that fragment and the total length of the sequence:
It works by recursively constructing a list that contains all the fragments that compose a gene’s sequence. To simplify further processing, we are going to transform that list of strings into a single string by concatenating all of its elements. We will also combine that operation with the call to frag.offset.to.id:
Let’s try it with the offset and length we got earlier for the feaR gene:
The last bit of processing we need to do now, is to handle the gene’s orientation. When it is Counterclockwise, we need to inverse the sequence, and take its complement. Taking the complement of a DNA sequence is done by swapping A for a T, C for a G, G for a C and T for an A. When the orientation is Clockwise no further processing of the reassembled sequence is necessary. We will create a new procedural knowledge to handle this:
It makes use of the primitive str.flip to invert the sequence, and of str.swap to swap the characters.
We are now ready to put the procedural knowledge gene.find together by combining all the procedural knowledge we have just created. Once we have finished processing the gene’s whole sequence, we will use the primitive str.find to test if it contains the fragment we are looking for:
We can then look for GATTACA in the entire E. coli genome:
As each query in fizz has a set time-to-live, if the value specified in the runtime configuration you are using is too low for the query to fully complete, the number of matching genes you will found will be lower than in the above example. 303 genes appears to be the correct answers.
Using a binary store for the raw DNA sequences
The addition of the elemental MRKCSBFStore in fizz 0.4 allows for the DNA sequences to be stored and readily available without occupying the
host system’s memory and with no loading time since we won’t be saving nor loading the fragments (as statements) from a fizz file.
In this section, we are going to look at using this new feature. First we are going to create a new fizz file which we will call fragz.fizz. In it, we will setup the elemental as follow:
The index property specifies that we want the statements to be stored in the binary store to be indexed based on the first term of each
statement, which is the identifier of the DNA fragments. This will allow for a much faster retrieval of the fragments since this is the way we will be
fetching them. Please refer to page 72 of fizz ’s user manual for more details on the elemental’s properties.
We are now ready to import the DNA sequences using the same import.fizz we used before:
Because we have set the frag elemental in a verbose mode, we are able to observe it ingesting the asserted statements and reporting the elapsed time in ingesting a single statement (which is here in the 400-500 microseconds range). Let check now if we have all the statements we were expecting:
61889 is the expected number of statements. We will now query the elemental just to check:
If you compare the elapsed time for the query (1 milisecond) when using the binary store to the one we got earlier when using in-memory statements, you
will notice that it is the same.
To complete the import, we will request from the elemental to optimize its content. Note that this won’t have any significant impact on the statement retreival speed since we quering using the indexed term:
While this takes a bit of time, it’s is only suggested to be done on occasion, depending on how much changes is made to the store content. Finaly, as the store is backed by an actual binary file, there is no need to save the elemental. To access the statements at a later time, we just need to import the fragz.fizz file:
If we were to load only the fizz files that we created earlier to store the 62k fragments, we would get a loading time of 1.797s vs 0.001s for the binary store. Using the store does, however, come at a cost as it is use a non-trivial amount of disk space. The fragz.sbfz we have created is over 32mb vs 5.6mb for all the fizz files that contains the fragments.