Importing and managing a large data set in fizz

Jean-Louis Villecroze
jlv@f1zz.org @CocoaGeek
September 1, 2018
Abstract

This article11Thanks to Robert Wasmann (@retrospasm) for providing feedback and reviewing this document. looks at how fizz (version 0.3 and up) deals with importing a large set of data for inferences, and explores possible optimizations for faster loading and statement retrieval time. In it, we will be using the Escherichia coli (abbreviated as E. coli) genome retreived from the ecogene.org website. We will also build the procedural knowledge needed to find out how many genes contains the famous GATTACA sequence in the E. coli genome. The completed application can be found in the etc/articles/e.coli folder of the fizz ’s distribution.

Prerequisite

A basic understanding of the concepts behind fizz is expected from the reader of this article. It is suggested to read the introductary article Building a simple stock prices monitor with fizz 22available on the web site first or at least read section two to four of the user manual for an overview of the language and runtime usage.

Importing the raw DNA sequences

The entirety of the E. coli genome can be downloaded from http://www.ecogene.org/ in the FASTA text format where the whole genomic DNA sequence is split over a large number of lines (of a known maximum length). We can use the import.txt command, to instruct fizz to process that file, and generate a statement for each line that it contains.

To get started, let’s try the command by extracting the first 10 lines from the file. We will first use the spy command to get fizz to show us the generated statements:

$ ./fizz.x64
fizz 0.3.0-X (20180519.2228) [x64|8|w|l]
?- /spy(append,line.f)
spy : observing line.f
?- /import.txt("./etc/data/U00096.3.txt",line.f,1,10)
import.txt : 10 lines read in 0.000s.
spy : S line.f(0, "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTG") := 1.00 (700.000000)
spy : S line.f(1, "AACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATA") := 1.00 (700.000000)
spy : S line.f(2, "GCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATC") := 1.00 (700.000000)
spy : S line.f(3, "ACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGG") := 1.00 (700.000000)
spy : S line.f(4, "CTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAAT") := 1.00 (700.000000)
spy : S line.f(5, "GCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCT") := 1.00 (700.000000)
spy : S line.f(6, "GCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAAT") := 1.00 (700.000000)
spy : S line.f(7, "ATCAGCGATGCCGAACGTATTTTTGCCGAACTTTTGACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCG") := 1.00 (700.000000)
spy : S line.f(8, "CAATTGAAAACTTTCGTCGATCAGGAATTTGCCCAAATAAAACATGTCCTGCATGGCATTAGTTTGTTGGGGCAG") := 1.00 (700.000000)
spy : S line.f(9, "TGCCCGGATAGCATCAACGCTGCGCTGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTA") := 1.00 (700.000000)

Since the import.txt primitive doesn’t assert the statements it generates (but declare them), we need to create an elemental object which will be doing that for us. For that, we create a fizz file called import.fizz and set up a single procedural knowledge definition which we’ll call import.frag. In it we will set up a single prototype which will be triggered by any line.f statements, and for now we will just output the line identifier to the console:

1import.frag {
2
3    () :- @line.f(:i,:s), console.puts(:i);
4
5}

We will now restart fizz and load the file we just created. Next use the import.txt command as we did just above:

$ ./fizz.x64 ./etc/articles/e.coli/import.fizz
fizz 0.3.0-X (20180519.2228) [x64|8|w|l]
load : loading ./etc/articles/e.coli/import.fizz ...
load : loaded  ./etc/articles/e.coli/import.fizz in 0.001s
?- /import.txt("./etc/data/U00096.3.txt",line.f,1,10)
import.txt : 10 lines read in 0.000s.
1
2
3
4
5
6
7
8
9

Now, since the statements that are being declared (broadcast in the substrate), and we would like them to be stored in memory, we need to modify the procedural knowledge to assert them using the assert primitive:

1import.frag {
2
3    () :- @line.f(:i,:s), assert(frag(:i,:s)), console.puts(:i);
4
5}

Re-running our code just as above won’t yield any visible difference; however, if we run the console command /stats we can see that there are now ten statements in the substrate:

$ ./fizz.x64 ./etc/articles/e.coli/import.fizz
fizz 0.3.0-X (20180519.2228) [x64|8|w|l]
load : loading ./etc/articles/e.coli/import.fizz ...
load : loaded  ./etc/articles/e.coli/import.fizz in 0.001s
?- /import.txt("./etc/data/U00096.3.txt",line.f,1,10)
import.txt : 10 lines read in 0.000s.
1
2
3
4
5
6
7
8
9
?- /stats
stats : e:7 k:3 s:10 p:1 u:15.33 t:1 q:0 r:0 z:29

We can also call the list console command to check that a new elemental has been created by fizz to collect all the statements we asserted:

?- /list
list : 0cd38dac-93cd-dd42-d998-af71e107397b MRKCBFSolver        import
list : f2850094-0398-0942-45ad-0134a4f86744 MRKCLettered        frag
list : 2 elementals listed in 0.000s

In order to avoid having to re-import the whole genome each time, we’re going to save the elemental to disk with the save console command. We will make sure to indicate that we only want to save the frag elemental, since that’s the end result of the import:

?- /save("./etc/articles/e.coli/frag.fizz",frag)
save : completed in 0.001s.

The contents of frag.fizz we just created shouldn’t be much of a surprise:

1frag {
2
3    (0, "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTG");
4    (1, "AACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATA");
5    (2, "GCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATC");
6    (3, "ACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGGG");
7    (4, "CTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAAGTTCGGCGGTACATCAGTGGCAAAT");
8    (5, "GCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCCAGGCAGGGGCAGGTGGCCACCGTCCTCTCT");
9    (6, "GCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTGAAAAAACCATTAGCGGCCAGGATGCTTTACCCAAT");
10    (7, "ATCAGCGATGCCGAACGTATTTTTGCCGAACTTTTGACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCG");
11    (8, "CAATTGAAAACTTTCGTCGATCAGGAATTTGCCCAAATAAAACATGTCCTGCATGGCATTAGTTTGTTGGGGCAG");
12    (9, "TGCCCGGATAGCATCAACGCTGCGCTGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTA");
13
14}

Now, we could import the entirety of the E. coli genome this way; however, this file is rather large (61890 lines) and so the runtime cost of asserting each of the 61889 statements is prohibitively high. Instead, we are going to replace the primitive assert by bundle as it allows for statements to be bundled into a single procedural knowledge which will be injected on the substrate, thus creating a new elemental to handle it.

Let’s modify import.fizz to use that primitive and instruct it to split all the statements into a bundle of 1024 statements:

1import.frag {
2
3    () :- @line.f(:i,:s), bundle(frag(:i,:s),1,{},1024), hush;
4
5}

Note that at the end of the prototype we have added a call to the primitive hush. This will make sure that the completion of the inference will not result in thousands of import.frag statements being published, since this is unnecessary. This will give us slightly better performance. If we now restart fizz , and import the first 2048 lines, we will get 2048 statements split over 2 elementals:

$ ./fizz.x64 ./etc/articles/e.coli/import.fizz
fizz 0.3.0-X (20180519.2228) [x64|8|w|l]
load : loading ./etc/articles/e.coli/import.fizz ...
load : loaded  ./etc/articles/e.coli/import.fizz in 0.001s
?- /import.txt("./etc/data/U00096.3.txt",line.f,1,2048)
import.txt : 1000 lines parsed ...
import.txt : 2000 lines parsed ...
import.txt : 2048 lines read in 0.013s.
?- /list
list : 352b248c-3dd2-444d-2593-aaee3921226e MRKCBFSolver        import
list : 18525ad5-a1b1-8a41-cfac-c8c78c9ca790 MRKCLettered        frag
list : cb705a4d-3092-7942-658f-aeee4cba1bdc MRKCLettered        frag
list : 3 elementals listed in 0.000s
?- /stats
stats : e:8 k:4 s:2048 p:1 u:41.93 t:0 q:0 r:0 z:4096

If we were to import the whole file, then 61 elementals will be created on the substrate to handle them, which is a little excessive, so we are going to change the bundle size we provide to the bundle primitive to 3072, and we will then get 22 elementals:

$ ./fizz.x64 ./etc/articles/e.coli/import.fizz
fizz 0.3.0-X (20180519.2228) [x64|8|w|l]
load : loading ./etc/articles/e.coli/import.fizz ...
load : loaded  ./etc/articles/e.coli/import.fizz in 0.001s
?- /import.txt("./etc/data/U00096.3.txt",line.f,1)
import.txt : 1000 lines parsed ...
import.txt : 2000 lines parsed ...
...
import.txt : 61000 lines parsed ...
import.txt : 61889 lines read in 0.397s.
?- /stats
stats : e:20 k:16 s:43008 p:1 u:15.94 t:13 q:0 r:0 z:105989
?- /stats
stats : e:24 k:20 s:55296 p:1 u:18.42 t:15 q:0 r:0 z:119389
?- /stats
stats : e:27 k:23 s:61889 p:1 u:21.28 t:880 q:0 r:0 z:123778
?- /stats
stats : e:27 k:23 s:61889 p:1 u:22.83 t:0 q:0 r:0 z:123778
?- /list
list : b073f751-abd8-f14f-1299-b7b45d05a1c8 MRKCBFSolver        import
list : bf02b90a-30ce-2a41-e283-fbd96beccd7f MRKCLettered        frag
list : bb4759d0-69fa-b348-3abf-19dbb0a161bb MRKCLettered        frag
list : b0489927-a51d-924b-0d87-ed12179ca732 MRKCLettered        frag
list : e427c9e6-792c-5f4b-d081-1db31eaf72a8 MRKCLettered        frag
list : e0f3714c-0811-8c4a-0994-46c1fec1e470 MRKCLettered        frag
list : 23891829-5335-8b42-0e90-ba77cb5e6d7f MRKCLettered        frag
list : 6d9660c3-ce66-104c-8cbf-58366b5f9996 MRKCLettered        frag
list : 9046cd34-107e-2e48-9aae-4226b0e582ce MRKCLettered        frag
list : df490e35-0b0a-a94b-e788-0e3adba67a27 MRKCLettered        frag
list : ae2b2e9b-1ee5-2a46-a197-e2a3b671c468 MRKCLettered        frag
list : 12e7b881-04c2-fb4c-5a9f-51da1817343d MRKCLettered        frag
list : 1852d124-a409-0344-f3b1-831a3d2e38e7 MRKCLettered        frag
list : 68811f8d-0df5-bc42-70aa-1dd5fe8e74ec MRKCLettered        frag
list : bd536caa-54d4-424d-7cbf-3e810274044a MRKCLettered        frag
list : e1e56c90-d943-374e-f7a7-df6146cc558a MRKCLettered        frag
list : 0399e71b-68dc-1c4c-ab9d-248fdd3e6511 MRKCLettered        frag
list : 6df47184-0aa1-574d-ab8d-1632255610e1 MRKCLettered        frag
list : bdcb5bf3-9295-3846-fa9b-01c592dcdab2 MRKCLettered        frag
list : 8c0aa796-62b7-5a41-1cbe-b615ef83128f MRKCLettered        frag
list : e4514816-922b-ad45-20a8-72416bd5fa6f MRKCLettered        frag
list : 7ddd475e-37d2-8445-98a3-3e5f11f107d1 MRKCLettered        frag
list : 22 elementals listed in 0.007s

Finally, we will save the statements into a fizz file, so that we can later reload it::

?- /save("./etc/articles/e.coli/frags.fizz",frag)
save : completed in 0.429s.

Since it’s a rather large file (5682972 bytes) for fizz to parse, don’t expect the loading time to be stellar:

$ ./fizz.x64 ./etc/articles/e.coli/frags.fizz
fizz 0.3.0-X (20180519.2228) [x64|8|w|l]
load : loading ./etc/articles/e.coli/frags.fizz ...
load : loaded  ./etc/articles/e.coli/frags.fizz in 7.244s
?- /list
list : dea95f02-9b8b-9e4f-54b2-ccc3dd755e51 MRKCLettered        frag
list : 5a05e004-2de2-6642-87b0-fd08a661daa8 MRKCLettered        frag
list : 0f7e1975-b14e-3449-6488-fcbc538600e0 MRKCLettered        frag
list : 065b74c2-00a8-d74d-72bb-19e2cba18c0c MRKCLettered        frag
list : 42b6e269-62e5-5b4f-7582-9659cad1cc48 MRKCLettered        frag
list : 91859291-b806-c849-3eb8-a83a39bffddb MRKCLettered        frag
list : 0a470c7c-3b9a-f142-b789-5a4c1d651c17 MRKCLettered        frag
list : 97b6a24d-fd51-4649-aead-e0ee7f6b1f14 MRKCLettered        frag
list : 2693353f-d98d-0b4d-1cad-ef003067efd3 MRKCLettered        frag
list : 0fe0fcb0-fd1f-6e45-3793-849650ac01be MRKCLettered        frag
list : ec6802b4-3a87-c94a-b193-bfbc32fdb1a2 MRKCLettered        frag
list : 48f6371a-0b1b-e24d-faa7-a3133d58c63a MRKCLettered        frag
list : c4fac853-80e3-2240-5a86-66da3ed658ca MRKCLettered        frag
list : a79b17d2-e1d3-de42-10a9-d084983388fa MRKCLettered        frag
list : 2766422f-f4ee-5e4a-b096-71ebb37e2455 MRKCLettered        frag
list : fc76a5aa-4379-6447-6a81-8c7542a95da2 MRKCLettered        frag
list : 38a4a6e0-db1b-d941-be9a-dd6de8bc2218 MRKCLettered        frag
list : 0bcff970-6738-cb4c-31bd-644e3b266be8 MRKCLettered        frag
list : 1edcd74b-0513-e84a-5690-d7e378b60aff MRKCLettered        frag
list : 1c9e1128-2796-a44e-2cae-9acf1e5fa8e9 MRKCLettered        frag
list : d329fe7f-c742-4a44-0ab4-846e3ee59d9d MRKCLettered        frag
list : 21 elementals listed in 0.002s
?- /stats
stats : e:26 k:22 s:61889 p:0 u:25.83 t:2 q:0 r:0 z:

As we are not actually transforming the statements we read from the text file, we can use a special mode of the import.txt command to bundle the statements over multiple elemental objects like we did above with the advantage of a much better runtime performance since we won’t be executing any inference for each statement.

This is done by adding a list of flags (symbols) as the last term of the call. Here we will ask the import.txt command to bundle the statements by 3072, and to spawn a new elemental for each bundle:

$ ./fizz.x64
fizz 0.3.0-X (20180519.2228) [x64|8|w|l]
?- /import.txt("./etc/data/U00096.3.txt",frag,1,3072,[loop,bundle,spawn])
import.txt : 1000 lines parsed ...
import.txt : 2000 lines parsed ...
...
import.txt : 61000 lines parsed ...
import.txt : 61889 lines read in 0.291s.
?- /stats
stats : e:26 k:22 s:61889 p:0 u:11.28 t:0 q:0 r:0 z:0
?- /list
list : 52383e12-0034-954e-d48c-3e0d15952362 MRKCLettered        frag
list : 7c00f988-3b0d-5249-9a83-14230680cfcb MRKCLettered        frag
list : 65e83f8a-17d1-4c40-9ca8-28f2339a6639 MRKCLettered        frag
list : 97498a7a-7f32-df4f-cdb6-06d866c05a8a MRKCLettered        frag
list : d94d31d9-cf81-cf44-94a8-c4b21743675b MRKCLettered        frag
list : 46063da9-44ea-2042-20b6-aaaf753e08c0 MRKCLettered        frag
list : b4520421-c09a-384c-349c-b49727869a01 MRKCLettered        frag
list : d7335353-a3bd-c14f-9cb1-c596c6d6aee0 MRKCLettered        frag
list : 2169255d-6855-884e-d59d-7808066ca3fb MRKCLettered        frag
list : 378798f5-8771-ac4a-4bb4-1d4bfadccfd4 MRKCLettered        frag
list : d4916964-bfd4-4d42-c8b9-6bea18c8fc1a MRKCLettered        frag
list : ab9982e2-27c1-5b4e-a7b3-b74f6b5bcc0f MRKCLettered        frag
list : 24100611-4edf-474e-4288-9c14fbbda099 MRKCLettered        frag
list : 75f4a47b-e770-3745-3784-83a906f32a58 MRKCLettered        frag
list : 56b4c9d1-50a9-3544-f6ac-3c498bf43515 MRKCLettered        frag
list : c53725de-bf65-ad46-8795-9c0dbd44e15b MRKCLettered        frag
list : f65b0bdc-4490-e049-3698-dbfdb616299c MRKCLettered        frag
list : bdeb4d21-e025-9543-8d8d-e5725331518a MRKCLettered        frag
list : 2cbb5aef-d545-194f-40a2-9cb47515d82e MRKCLettered        frag
list : 426917ff-d5a1-d544-db87-25868c0b202a MRKCLettered        frag
list : c42bf3d3-77ba-c648-31af-c88128dedcb3 MRKCLettered        frag
list : 21 elementals listed in 0.001s

Importing the genes descriptions

The second data set we are going to import into fizz is a collection of all the identified genes for the E. coli genome. The set is in a CSV formatted document (with tabulation as separator), so we will be using the command /import.csv. Unlike the DNA data, we are going to have to do some transformation on each of the statements that will be extracted, which means it won’t be able to make the import as fast as for the sequences. Hopefully there’s only 4504 lines to be processed in that file.

Each of the lines from the CSV file contains the following 12 elements, some of which we will be ignoring:

EG EcoGene Accession Number.
ECK K-12 Gene Accession Number.
Gene Primary Gene Name.
Syn Alternate Gene Symbols.
Type Genotype.
Len Sequence length.
Orientation Orientation (Clockwise, Counterclockwise).
LeftEnd Genomic Address, left end.
RightEnd Genomic Address, right end.
Protein Protein description.
Function Known function.
Description Description.
Comments Comments.

Let’s start by adding to the import.fizz file we have already been using a new procedural knowledge definition with a single prototype, which when triggered by any statements published by the import.csv command will print to the gene’s identifier to the console:

1import.gene {
2
3    () :-   @line.gene(:t1,:t2,:gene,:t4,:t5,:len,:t7,:leftend,:rightend,:t10,:t11,_,:comments),
4            console.puts(:t1),
5            hush;
6
7}

We can then go ahead and test that by importing the first 10 lines from the file:

$ ./fizz.x64 ./etc/articles/e.coli/import.fizz
fizz 0.3.0-X (20180519.2228) [x64|8|w|l]
load : loading ./etc/articles/e.coli/import.fizz ...
load : loaded  ./etc/articles/e.coli/import.fizz in 0.007s
?- /import.csv("./etc/data/EcoData022718-235543.txt",line.gene,"\t",[],1,10)
import.csv : 10 lines read in 0.001s.
EG10001
EG10002
EG10003
EG10004
EG10005
EG10006
EG10007
EG10008
EG10009
EG10010

Because the import.csv command doesn’t automatically convert strings into symbols, we are going to have to do so for all the terms that we wish to be handled as symbols. For that we will use the primitive str.tosym to convert the following terms: EG, ECK, Type, and Orientation:

1import.gene {
2
3    () :-   @line.gene(:t1,:t2,:gene,:t4,:t5,:len,:t7,:leftend,:rightend,:t10,:t11,_,:comments),
4            str.tosym(:t1,:eg),
5            str.tosym(:t2,:eck),
6            str.tosym(:t5,:type),
7            str.tosym(:t7,:ori),
8            console.puts(:t1),
9            hush;
10
11}

Ideally, we would have like to also convert the gene identifier to a symbol (e.g. hisM). Unfortunately some of them contains unsuitable characters (e.g. rhsE') so we will leave them as strings. Next, we will use primitive str.tokenize to transform the fourth term (Alternate Gene Symbols) from a string to a list of strings, as that field uses a comma to separate the symbols:

1import.gene {
2
3    () :-   @line.gene(:t1,:t2,:gene,:t4,:t5,:len,:t7,:leftend,:rightend,:t10,:t11,_,:comments),
4            str.tosym(:t1,:eg),
5            str.tosym(:t2,:eck),
6            str.tosym(:t5,:type),
7            str.tosym(:t7,:ori),
8            str.tokenize(:t4,",",:alts),
9            console.puts(:t1,": ",:alts),
10            hush;
11
12}

If we run the import again, then we can make two observations which will improve the representation of the data:

?- /import.csv("./etc/data/EcoData022718-235543.txt",line.gene,"\t",[],1,10)
import.csv : 10 lines read in 0.001s.
EG10001: ["None"]
EG10002: ["chlJ"]
EG10003: ["None"]
EG10004: ["coaBC"]
EG10005: ["fpr", " fruF"]
EG10006: ["genF"]
EG10007: ["None"]
EG10008: ["hydH"]
EG10009: ["icd"]
EG10010: ["None"]

the first is that when the string is "None", then we should be using an empty list. The second observation is that we should trim each of the strings, since we can see instances where extra space shows up (e.g. " fruF"). To that end, we’re going to add two procedural knowledge definitions to our import.fizz file to perform that transformation. The first one, which we will call clean.list, handles the trimming of any strings in a list. It works recursively (like most things in fizz ), and uses the primitive str.trim for the actual trimming of the strings:

1clean.list {
2
3    ([],[])                 ^:- true;
4    ([:e?[is.string]],[:f]) ^:- str.trim(:e,:f);
5    ([:h|:t],[:h2|:t2])      :- #clean.list(:t,:t2), str.trim(:h,:h2);
6
7}

As a reminder, the caret that you see on the first two prototypes indicates that if the entry point does unify, the solver shouldn’t try to use any of the following prototypes. Now, the second bit of procedural knowledge we are going to add is the one we will be directly calling in import.gene. It simply either matchs ["None"] to an empty list, or calls clean.list:

1import.gene.clean {
2
3    (["None"],[])  ^:- true;
4    (:l,:cl)        :- #clean.list(:l,:cl);
5
6}

We can now add the transformation of the fourth term to import.gene:

1import.gene {
2
3    () :-   @line.gene(:t1,:t2,:gene,:t4,:t5,:len,:t7,:leftend,:rightend,:t10,:t11,_,:comments),
4            str.tosym(:t1,:eg),
5            str.tosym(:t2,:eck),
6            str.tosym(:t5,:type),
7            str.tosym(:t7,:ori),
8            str.tokenize(:t4,",",:t4.1), #import.gene.clean(:t4.1,:syn),
9            console.puts(:t1,": ",:syn),
10            hush;
11
12}

If we now perform the test import again, then we get a much better result:

?- /import.csv("./etc/data/EcoData022718-235543.txt",line.gene,"\t",[],1,10)
import.csv : 10 lines read in 0.001s.
EG10001: []
EG10003: []
EG10002: ["chlJ"]
EG10004: ["coaBC"]
EG10005: ["fpr", "fruF"]
EG10007: []
EG10006: ["genF"]
EG10008: ["hydH"]
EG10010: []
EG10009: ["icd"]

Now, we will perform the same transformation for the 10th and 11th terms, but before that we’re going to make a small addition to import.gene.clean to handle the fact that the string can have a value of "Null", which we will handle like we did for "None":

1import.gene.clean {
2
3    (["None"],[])  ^:- true;
4    (["Null"],[])  ^:- true;
5    (:l,:cl)        :- #clean.list(:l,:cl);
6
7}
8
9import.gene {
10
11    () :-   @line.gene(:t1,:t2,:gene,:t4,:t5,:len,:t7,:leftend,:rightend,:t10,:t11,_,:comments),
12            str.tosym(:t1,:eg),
13            str.tosym(:t2,:eck),
14            str.tosym(:t5,:type),
15            str.tosym(:t7,:ori),
16            str.tokenize(:t4,",",:t4.1), #import.gene.clean(:t4.1,:syn),
17            str.tokenize(:t10,";",:t10.1), #import.gene.clean(:t10.1,:protein),
18            str.tokenize(:t11,";",:t11.1), #import.gene.clean(:t11.1,:function),
19            console.puts(:t1,": ",:function),
20            hush;
21
22}

Lastly, we just need to assert a statement for each gene we are importing. To speed things up we’re going to use the bundle primitive like we did earlier with a bundle size of 2048:

1import.gene {
2
3    () :-   @line.gene(:t1,:t2,:gene,:t4,:t5,:len,:t7,:leftend,:rightend,:t10,:t11,_,:comments),
4            str.tosym(:t1,:eg),
5            str.tosym(:t2,:eck),
6            str.tosym(:t5,:type),
7            str.tosym(:t7,:ori),
8            str.tokenize(:t4,",",:t4.1), #import.gene.clean(:t4.1,:syn),
9            str.tokenize(:t10,";",:t10.1), #import.gene.clean(:t10.1,:protein),
10            str.tokenize(:t11,";",:t11.1), #import.gene.clean(:t11.1,:function),
11            bundle(gene(:eg,:eck,:gene,:syn,:type,:len,:ori,:leftend,:rightend,:protein,:function,:comments),1,{},2048),
12            hush;
13
14}

We are now ready to import the whole content of the file. Note that depending on the runtime settings, and your system’s performance, the number of elementals that will be spawned may be different and they may not all have exactly 2048 statements in it. We will then save the statements into a fizz file:

?- /import.csv("./etc/data/EcoData022718-235543.txt",line.gene,"\t",[],1)
import.csv : 1000 lines parsed ...
import.csv : 2000 lines parsed ...
import.csv : 3000 lines parsed ...
import.csv : 4000 lines parsed ...
import.csv : 4504 lines read in 0.273s.
?- /list
list : f5ef9cec-8462-1e49-b08d-70889857cc03 MRKCBFSolver        clean.list
list : 40946431-403d-af4b-ddb2-297f0d869d40 MRKCLettered        gene
list : f001366a-680a-c540-b0a4-c7582d6d078d MRKCLettered        gene
list : fc5cfd50-1270-a540-3a80-d677f8e50eb7 MRKCLettered        gene
list : a65b3e1e-11ab-d845-44a1-f580bc4a37d0 MRKCLettered        gene
list : a958a4b2-17c1-f542-1892-4c7a748109a1 MRKCBFSolver        import.gene
list : 16269115-186e-c34b-178d-3ee7bcef64e6 MRKCBFSolver        import.gene.clean
list : a519801f-0745-e143-ab98-6cae94920398 MRKCBFSolver        import.frag
list : 8 elementals listed in 0.000s
?- /stats
stats : e:13 k:9 s:4504 p:8 u:23.02 t:18 q:37875 r:37875 z:4504
?- /save("./etc/articles/e.coli/genes.fizz",gene)
save : completed in 0.114s.

Optimizing loading time

Now that we have transformed both raw data files into the form of factual knowledge representation that fizz can easily manipulate, we are going to look at how to speed-up the loading time. Let’s first assess the loading time by loading both files when starting fizz :

$ ./fizz.x64 ./etc/articles/e.coli/genes.fizz ./etc/articles/e.coli/frags.fizz
fizz 0.3.0-X (20180519.2228) [x64|8|w|l]
load : loading ./etc/articles/e.coli/genes.fizz ...
load : loading ./etc/articles/e.coli/frags.fizz ...
load : loaded  ./etc/articles/e.coli/genes.fizz in 2.238s
load : loaded  ./etc/articles/e.coli/frags.fizz in 5.493s
load : loading completed in 5.493s
?- /stats
stats : e:30 k:26 s:66393 p:0 u:33.16 t:9 q:0 r:0 z:

To speed-up the loading time we are going to take advantage of a native feature of fizz - the runtime will try to load each specified file concurrently. This means we need to split any large file into multiple files which can then be loaded concurrently (fizz is setup to use up to half of the cores it is enabled on to parallelize the loading).

We are going to start by the frags.fizz file since it contains over 60000 statement definitions spreads over 21 factual knowledges. Using the primitive fzz.lst (which can only be called as an offload), we are going to obtain the GUID of all these elementals and group them, so that we can then save each group into a separate file. To do that, we first need to add to the import.fizz (which we have been working on for a while now) some new procedural knowledges that will allow us to break a large list into small sub-lists.

For that, we first declare a procedural knowledge which we will call lst.split which splits a list into two based on an arbitrary number of elements to be included in the first list:

1lst.split {
2
3    ([],_,[],[])               ^:- true;
4    ([:e],_,[:e],[])           ^:- true;
5    ([:h|:r],1,[:h],:r)        ^:- true;
6    ([:h|:r],:c,[:h|:l1],:l2)   :- sub(:c,1,:c1), #lst.split(:r,:c1,:l1,:l2);
7
8}

We will then use it in another procedural knowledge (which we call lst.break) as follows:

1lst.break {
2
3    ([],_,[])           ^:- true;
4    ([:e],_,[[:e]])     ^:- true;
5    (:l,:n,[:l2|:l3])    :- #lst.split(:l,:n,:l2,:r), #lst.break(:r,:n,:l3);
6
7}

We’ll quickly check that they both work as expected:

?- #lst.break([a,b,c,d,e,f],2,:l)
-> ( [[a, b], [c, d], [e, f]] ) := 1.00 (0.011) 1
?- #lst.break([a,b,c,d,e,f,g],2,:l)
-> ( [[a, b], [c, d], [e, f], [g]] ) := 1.00 (0.012) 1

By combining lst.break, and fzz.lst, we are able to split all the elementals (using their GUIDs) into sub-lists:

?- &fzz.lst(frag,:l), #lst.break(:l,3,:l2)
-> ( ["93c52b11-d765-9741-89b3-6869f6eac854", "6b14f85e-aad1-d148-efb6-58f4f89ecf93", "eedf5b25-45b4-6c47-ff88-04a934f6ebc2", "f89e5abc-40e4-b140-f190-e2ad51c274b5", "23656e64-c0d0-aa48-22af-d5a3730fe5d7", "050d91ad-d2eb-cf4b-94bd-5b7da5d64526", "e9b1603e-32f6-6a48-0c99-2e38a0d6362e", "2b285f0d-ee95-8f47-e3a3-69eca43ef8a9", "995f1eeb-5795-5b4d-2bb2-dee45381126b", "3a3c92c2-3995-6f4e-7a83-92b33018f682", "eec06c04-2081-db4c-ed90-a84636a8a6cb", "3328c60c-a842-8945-63be-ab0762eeb262", "dd0503c8-ac9f-1f4a-a086-6541f62d5781", "51156d15-40a3-9349-7d85-c83852f2f1eb", "62952974-2a70-b647-1ca5-346c4ff67679", "8c5e42b1-21dc-8e47-4ba3-480f22cb26cf", "7332b836-ec54-ea45-4aac-90fadea47203", "9dc15b29-4cd3-b144-8493-5c3ca70c1cee", "6b0b3dd4-76a2-2746-1f89-af9245803418", "30331e58-ae61-ff45-7496-65c29acef854", "5c0c524b-cd92-9c42-23a3-01955a487c84"] , [["93c52b11-d765-9741-89b3-6869f6eac854", "6b14f85e-aad1-d148-efb6-58f4f89ecf93", "eedf5b25-45b4-6c47-ff88-04a934f6ebc2"], ["f89e5abc-40e4-b140-f190-e2ad51c274b5", "23656e64-c0d0-aa48-22af-d5a3730fe5d7", "050d91ad-d2eb-cf4b-94bd-5b7da5d64526"], ["e9b1603e-32f6-6a48-0c99-2e38a0d6362e", "2b285f0d-ee95-8f47-e3a3-69eca43ef8a9", "995f1eeb-5795-5b4d-2bb2-dee45381126b"], ["3a3c92c2-3995-6f4e-7a83-92b33018f682", "eec06c04-2081-db4c-ed90-a84636a8a6cb", "3328c60c-a842-8945-63be-ab0762eeb262"], ["dd0503c8-ac9f-1f4a-a086-6541f62d5781", "51156d15-40a3-9349-7d85-c83852f2f1eb", "62952974-2a70-b647-1ca5-346c4ff67679"], ["8c5e42b1-21dc-8e47-4ba3-480f22cb26cf", "7332b836-ec54-ea45-4aac-90fadea47203", "9dc15b29-4cd3-b144-8493-5c3ca70c1cee"], ["6b0b3dd4-76a2-2746-1f89-af9245803418", "30331e58-ae61-ff45-7496-65c29acef854", "5c0c524b-cd92-9c42-23a3-01955a487c84"]] ) := 1.00 (0.018) 1

From there, we will get every sub-list, make up an appropriate filename for it, and use the console.exec primitive to execute the command save which when given a list of GUIDs will save the identified elementals into the same file:

?- &fzz.lst(frag,:l), #lst.break(:l,3,:l2), lst.item(:l2,:i,:e), str.cat("./etc/articles/e.coli/frag-",:i,".fizz",:n), console.exec(save(:n,:e))
-> ( ["9dc15b29-4cd3-b144-8493-5c3ca70c1cee", "30331e58-ae61-ff45-7496-65c29acef854", "eedf5b25-45b4-6c47-ff88-04a934f6ebc2", "3328c60c-a842-8945-63be-ab0762eeb262", "6b14f85e-aad1-d148-efb6-58f4f89ecf93", "62952974-2a70-b647-1ca5-346c4ff67679", "dd0503c8-ac9f-1f4a-a086-6541f62d5781", "5c0c524b-cd92-9c42-23a3-01955a487c84", "8c5e42b1-21dc-8e47-4ba3-480f22cb26cf", "7332b836-ec54-ea45-4aac-90fadea47203", "2b285f0d-ee95-8f47-e3a3-69eca43ef8a9", "3a3c92c2-3995-6f4e-7a83-92b33018f682", "6b0b3dd4-76a2-2746-1f89-af9245803418", "23656e64-c0d0-aa48-22af-d5a3730fe5d7", "e9b1603e-32f6-6a48-0c99-2e38a0d6362e", "51156d15-40a3-9349-7d85-c83852f2f1eb", "f89e5abc-40e4-b140-f190-e2ad51c274b5", "93c52b11-d765-9741-89b3-6869f6eac854", "995f1eeb-5795-5b4d-2bb2-dee45381126b", "eec06c04-2081-db4c-ed90-a84636a8a6cb", "050d91ad-d2eb-cf4b-94bd-5b7da5d64526"] , [["9dc15b29-4cd3-b144-8493-5c3ca70c1cee", "30331e58-ae61-ff45-7496-65c29acef854", "eedf5b25-45b4-6c47-ff88-04a934f6ebc2"], ["3328c60c-a842-8945-63be-ab0762eeb262", "6b14f85e-aad1-d148-efb6-58f4f89ecf93", "62952974-2a70-b647-1ca5-346c4ff67679"], ["dd0503c8-ac9f-1f4a-a086-6541f62d5781", "5c0c524b-cd92-9c42-23a3-01955a487c84", "8c5e42b1-21dc-8e47-4ba3-480f22cb26cf"], ["7332b836-ec54-ea45-4aac-90fadea47203", "2b285f0d-ee95-8f47-e3a3-69eca43ef8a9", "3a3c92c2-3995-6f4e-7a83-92b33018f682"], ["6b0b3dd4-76a2-2746-1f89-af9245803418", "23656e64-c0d0-aa48-22af-d5a3730fe5d7", "e9b1603e-32f6-6a48-0c99-2e38a0d6362e"], ["51156d15-40a3-9349-7d85-c83852f2f1eb", "f89e5abc-40e4-b140-f190-e2ad51c274b5", "93c52b11-d765-9741-89b3-6869f6eac854"], ["995f1eeb-5795-5b4d-2bb2-dee45381126b", "eec06c04-2081-db4c-ed90-a84636a8a6cb", "050d91ad-d2eb-cf4b-94bd-5b7da5d64526"]] , 0 , ["9dc15b29-4cd3-b144-8493-5c3ca70c1cee", "30331e58-ae61-ff45-7496-65c29acef854", "eedf5b25-45b4-6c47-ff88-04a934f6ebc2"] , "./etc/articles/e.coli/frag-0.fizz" ) := 1.00 (0.018) 1
-> ( ["9dc15b29-4cd3-b144-8493-5c3ca70c1cee", "30331e58-ae61-ff45-7496-65c29acef854", "eedf5b25-45b4-6c47-ff88-04a934f6ebc2", "3328c60c-a842-8945-63be-ab0762eeb262", "6b14f85e-aad1-d148-efb6-58f4f89ecf93", "62952974-2a70-b647-1ca5-346c4ff67679", "dd0503c8-ac9f-1f4a-a086-6541f62d5781", "5c0c524b-cd92-9c42-23a3-01955a487c84", "8c5e42b1-21dc-8e47-4ba3-480f22cb26cf", "7332b836-ec54-ea45-4aac-90fadea47203", "2b285f0d-ee95-8f47-e3a3-69eca43ef8a9", "3a3c92c2-3995-6f4e-7a83-92b33018f682", "6b0b3dd4-76a2-2746-1f89-af9245803418", "23656e64-c0d0-aa48-22af-d5a3730fe5d7", "e9b1603e-32f6-6a48-0c99-2e38a0d6362e", "51156d15-40a3-9349-7d85-c83852f2f1eb", "f89e5abc-40e4-b140-f190-e2ad51c274b5", "93c52b11-d765-9741-89b3-6869f6eac854", "995f1eeb-5795-5b4d-2bb2-dee45381126b", "eec06c04-2081-db4c-ed90-a84636a8a6cb", "050d91ad-d2eb-cf4b-94bd-5b7da5d64526"] , [["9dc15b29-4cd3-b144-8493-5c3ca70c1cee", "30331e58-ae61-ff45-7496-65c29acef854", "eedf5b25-45b4-6c47-ff88-04a934f6ebc2"], ["3328c60c-a842-8945-63be-ab0762eeb262", "6b14f85e-aad1-d148-efb6-58f4f89ecf93", "62952974-2a70-b647-1ca5-346c4ff67679"], ["dd0503c8-ac9f-1f4a-a086-6541f62d5781", "5c0c524b-cd92-9c42-23a3-01955a487c84", "8c5e42b1-21dc-8e47-4ba3-480f22cb26cf"], ["7332b836-ec54-ea45-4aac-90fadea47203", "2b285f0d-ee95-8f47-e3a3-69eca43ef8a9", "3a3c92c2-3995-6f4e-7a83-92b33018f682"], ["6b0b3dd4-76a2-2746-1f89-af9245803418", "23656e64-c0d0-aa48-22af-d5a3730fe5d7", "e9b1603e-32f6-6a48-0c99-2e38a0d6362e"], ["51156d15-40a3-9349-7d85-c83852f2f1eb", "f89e5abc-40e4-b140-f190-e2ad51c274b5", "93c52b11-d765-9741-89b3-6869f6eac854"], ["995f1eeb-5795-5b4d-2bb2-dee45381126b", "eec06c04-2081-db4c-ed90-a84636a8a6cb", "050d91ad-d2eb-cf4b-94bd-5b7da5d64526"]] , 1 , ["3328c60c-a842-8945-63be-ab0762eeb262", "6b14f85e-aad1-d148-efb6-58f4f89ecf93", "62952974-2a70-b647-1ca5-346c4ff67679"] , "./etc/articles/e.coli/frag-1.fizz" ) := 1.00 (0.018) 2
-> ( ["9dc15b29-4cd3-b144-8493-5c3ca70c1cee", "30331e58-ae61-ff45-7496-65c29acef854", "eedf5b25-45b4-6c47-ff88-04a934f6ebc2", "3328c60c-a842-8945-63be-ab0762eeb262", "6b14f85e-aad1-d148-efb6-58f4f89ecf93", "62952974-2a70-b647-1ca5-346c4ff67679", "dd0503c8-ac9f-1f4a-a086-6541f62d5781", "5c0c524b-cd92-9c42-23a3-01955a487c84", "8c5e42b1-21dc-8e47-4ba3-480f22cb26cf", "7332b836-ec54-ea45-4aac-90fadea47203", "2b285f0d-ee95-8f47-e3a3-69eca43ef8a9", "3a3c92c2-3995-6f4e-7a83-92b33018f682", "6b0b3dd4-76a2-2746-1f89-af9245803418", "23656e64-c0d0-aa48-22af-d5a3730fe5d7", "e9b1603e-32f6-6a48-0c99-2e38a0d6362e", "51156d15-40a3-9349-7d85-c83852f2f1eb", "f89e5abc-40e4-b140-f190-e2ad51c274b5", "93c52b11-d765-9741-89b3-6869f6eac854", "995f1eeb-5795-5b4d-2bb2-dee45381126b", "eec06c04-2081-db4c-ed90-a84636a8a6cb", "050d91ad-d2eb-cf4b-94bd-5b7da5d64526"] , [["9dc15b29-4cd3-b144-8493-5c3ca70c1cee", "30331e58-ae61-ff45-7496-65c29acef854", "eedf5b25-45b4-6c47-ff88-04a934f6ebc2"], ["3328c60c-a842-8945-63be-ab0762eeb262", "6b14f85e-aad1-d148-efb6-58f4f89ecf93", "62952974-2a70-b647-1ca5-346c4ff67679"], ["dd0503c8-ac9f-1f4a-a086-6541f62d5781", "5c0c524b-cd92-9c42-23a3-01955a487c84", "8c5e42b1-21dc-8e47-4ba3-480f22cb26cf"], ["7332b836-ec54-ea45-4aac-90fadea47203", "2b285f0d-ee95-8f47-e3a3-69eca43ef8a9", "3a3c92c2-3995-6f4e-7a83-92b33018f682"], ["6b0b3dd4-76a2-2746-1f89-af9245803418", "23656e64-c0d0-aa48-22af-d5a3730fe5d7", "e9b1603e-32f6-6a48-0c99-2e38a0d6362e"], ["51156d15-40a3-9349-7d85-c83852f2f1eb", "f89e5abc-40e4-b140-f190-e2ad51c274b5", "93c52b11-d765-9741-89b3-6869f6eac854"], ["995f1eeb-5795-5b4d-2bb2-dee45381126b", "eec06c04-2081-db4c-ed90-a84636a8a6cb", "050d91ad-d2eb-cf4b-94bd-5b7da5d64526"]] , 2 , ["dd0503c8-ac9f-1f4a-a086-6541f62d5781", "5c0c524b-cd92-9c42-23a3-01955a487c84", "8c5e42b1-21dc-8e47-4ba3-480f22cb26cf"] , "./etc/articles/e.coli/frag-2.fizz" ) := 1.00 (0.018) 3
-> ( ["9dc15b29-4cd3-b144-8493-5c3ca70c1cee", "30331e58-ae61-ff45-7496-65c29acef854", "eedf5b25-45b4-6c47-ff88-04a934f6ebc2", "3328c60c-a842-8945-63be-ab0762eeb262", "6b14f85e-aad1-d148-efb6-58f4f89ecf93", "62952974-2a70-b647-1ca5-346c4ff67679", "dd0503c8-ac9f-1f4a-a086-6541f62d5781", "5c0c524b-cd92-9c42-23a3-01955a487c84", "8c5e42b1-21dc-8e47-4ba3-480f22cb26cf", "7332b836-ec54-ea45-4aac-90fadea47203", "2b285f0d-ee95-8f47-e3a3-69eca43ef8a9", "3a3c92c2-3995-6f4e-7a83-92b33018f682", "6b0b3dd4-76a2-2746-1f89-af9245803418", "23656e64-c0d0-aa48-22af-d5a3730fe5d7", "e9b1603e-32f6-6a48-0c99-2e38a0d6362e", "51156d15-40a3-9349-7d85-c83852f2f1eb", "f89e5abc-40e4-b140-f190-e2ad51c274b5", "93c52b11-d765-9741-89b3-6869f6eac854", "995f1eeb-5795-5b4d-2bb2-dee45381126b", "eec06c04-2081-db4c-ed90-a84636a8a6cb", "050d91ad-d2eb-cf4b-94bd-5b7da5d64526"] , [["9dc15b29-4cd3-b144-8493-5c3ca70c1cee", "30331e58-ae61-ff45-7496-65c29acef854", "eedf5b25-45b4-6c47-ff88-04a934f6ebc2"], ["3328c60c-a842-8945-63be-ab0762eeb262", "6b14f85e-aad1-d148-efb6-58f4f89ecf93", "62952974-2a70-b647-1ca5-346c4ff67679"], ["dd0503c8-ac9f-1f4a-a086-6541f62d5781", "5c0c524b-cd92-9c42-23a3-01955a487c84", "8c5e42b1-21dc-8e47-4ba3-480f22cb26cf"], ["7332b836-ec54-ea45-4aac-90fadea47203", "2b285f0d-ee95-8f47-e3a3-69eca43ef8a9", "3a3c92c2-3995-6f4e-7a83-92b33018f682"], ["6b0b3dd4-76a2-2746-1f89-af9245803418", "23656e64-c0d0-aa48-22af-d5a3730fe5d7", "e9b1603e-32f6-6a48-0c99-2e38a0d6362e"], ["51156d15-40a3-9349-7d85-c83852f2f1eb", "f89e5abc-40e4-b140-f190-e2ad51c274b5", "93c52b11-d765-9741-89b3-6869f6eac854"], ["995f1eeb-5795-5b4d-2bb2-dee45381126b", "eec06c04-2081-db4c-ed90-a84636a8a6cb", "050d91ad-d2eb-cf4b-94bd-5b7da5d64526"]] , 3 , ["7332b836-ec54-ea45-4aac-90fadea47203", "2b285f0d-ee95-8f47-e3a3-69eca43ef8a9", "3a3c92c2-3995-6f4e-7a83-92b33018f682"] , "./etc/articles/e.coli/frag-3.fizz" ) := 1.00 (0.018) 4
-> ( ["9dc15b29-4cd3-b144-8493-5c3ca70c1cee", "30331e58-ae61-ff45-7496-65c29acef854", "eedf5b25-45b4-6c47-ff88-04a934f6ebc2", "3328c60c-a842-8945-63be-ab0762eeb262", "6b14f85e-aad1-d148-efb6-58f4f89ecf93", "62952974-2a70-b647-1ca5-346c4ff67679", "dd0503c8-ac9f-1f4a-a086-6541f62d5781", "5c0c524b-cd92-9c42-23a3-01955a487c84", "8c5e42b1-21dc-8e47-4ba3-480f22cb26cf", "7332b836-ec54-ea45-4aac-90fadea47203", "2b285f0d-ee95-8f47-e3a3-69eca43ef8a9", "3a3c92c2-3995-6f4e-7a83-92b33018f682", "6b0b3dd4-76a2-2746-1f89-af9245803418", "23656e64-c0d0-aa48-22af-d5a3730fe5d7", "e9b1603e-32f6-6a48-0c99-2e38a0d6362e", "51156d15-40a3-9349-7d85-c83852f2f1eb", "f89e5abc-40e4-b140-f190-e2ad51c274b5", "93c52b11-d765-9741-89b3-6869f6eac854", "995f1eeb-5795-5b4d-2bb2-dee45381126b", "eec06c04-2081-db4c-ed90-a84636a8a6cb", "050d91ad-d2eb-cf4b-94bd-5b7da5d64526"] , [["9dc15b29-4cd3-b144-8493-5c3ca70c1cee", "30331e58-ae61-ff45-7496-65c29acef854", "eedf5b25-45b4-6c47-ff88-04a934f6ebc2"], ["3328c60c-a842-8945-63be-ab0762eeb262", "6b14f85e-aad1-d148-efb6-58f4f89ecf93", "62952974-2a70-b647-1ca5-346c4ff67679"], ["dd0503c8-ac9f-1f4a-a086-6541f62d5781", "5c0c524b-cd92-9c42-23a3-01955a487c84", "8c5e42b1-21dc-8e47-4ba3-480f22cb26cf"], ["7332b836-ec54-ea45-4aac-90fadea47203", "2b285f0d-ee95-8f47-e3a3-69eca43ef8a9", "3a3c92c2-3995-6f4e-7a83-92b33018f682"], ["6b0b3dd4-76a2-2746-1f89-af9245803418", "23656e64-c0d0-aa48-22af-d5a3730fe5d7", "e9b1603e-32f6-6a48-0c99-2e38a0d6362e"], ["51156d15-40a3-9349-7d85-c83852f2f1eb", "f89e5abc-40e4-b140-f190-e2ad51c274b5", "93c52b11-d765-9741-89b3-6869f6eac854"], ["995f1eeb-5795-5b4d-2bb2-dee45381126b", "eec06c04-2081-db4c-ed90-a84636a8a6cb", "050d91ad-d2eb-cf4b-94bd-5b7da5d64526"]] , 4 , ["6b0b3dd4-76a2-2746-1f89-af9245803418", "23656e64-c0d0-aa48-22af-d5a3730fe5d7", "e9b1603e-32f6-6a48-0c99-2e38a0d6362e"] , "./etc/articles/e.coli/frag-4.fizz" ) := 1.00 (0.018) 5
-> ( ["9dc15b29-4cd3-b144-8493-5c3ca70c1cee", "30331e58-ae61-ff45-7496-65c29acef854", "eedf5b25-45b4-6c47-ff88-04a934f6ebc2", "3328c60c-a842-8945-63be-ab0762eeb262", "6b14f85e-aad1-d148-efb6-58f4f89ecf93", "62952974-2a70-b647-1ca5-346c4ff67679", "dd0503c8-ac9f-1f4a-a086-6541f62d5781", "5c0c524b-cd92-9c42-23a3-01955a487c84", "8c5e42b1-21dc-8e47-4ba3-480f22cb26cf", "7332b836-ec54-ea45-4aac-90fadea47203", "2b285f0d-ee95-8f47-e3a3-69eca43ef8a9", "3a3c92c2-3995-6f4e-7a83-92b33018f682", "6b0b3dd4-76a2-2746-1f89-af9245803418", "23656e64-c0d0-aa48-22af-d5a3730fe5d7", "e9b1603e-32f6-6a48-0c99-2e38a0d6362e", "51156d15-40a3-9349-7d85-c83852f2f1eb", "f89e5abc-40e4-b140-f190-e2ad51c274b5", "93c52b11-d765-9741-89b3-6869f6eac854", "995f1eeb-5795-5b4d-2bb2-dee45381126b", "eec06c04-2081-db4c-ed90-a84636a8a6cb", "050d91ad-d2eb-cf4b-94bd-5b7da5d64526"] , [["9dc15b29-4cd3-b144-8493-5c3ca70c1cee", "30331e58-ae61-ff45-7496-65c29acef854", "eedf5b25-45b4-6c47-ff88-04a934f6ebc2"], ["3328c60c-a842-8945-63be-ab0762eeb262", "6b14f85e-aad1-d148-efb6-58f4f89ecf93", "62952974-2a70-b647-1ca5-346c4ff67679"], ["dd0503c8-ac9f-1f4a-a086-6541f62d5781", "5c0c524b-cd92-9c42-23a3-01955a487c84", "8c5e42b1-21dc-8e47-4ba3-480f22cb26cf"], ["7332b836-ec54-ea45-4aac-90fadea47203", "2b285f0d-ee95-8f47-e3a3-69eca43ef8a9", "3a3c92c2-3995-6f4e-7a83-92b33018f682"], ["6b0b3dd4-76a2-2746-1f89-af9245803418", "23656e64-c0d0-aa48-22af-d5a3730fe5d7", "e9b1603e-32f6-6a48-0c99-2e38a0d6362e"], ["51156d15-40a3-9349-7d85-c83852f2f1eb", "f89e5abc-40e4-b140-f190-e2ad51c274b5", "93c52b11-d765-9741-89b3-6869f6eac854"], ["995f1eeb-5795-5b4d-2bb2-dee45381126b", "eec06c04-2081-db4c-ed90-a84636a8a6cb", "050d91ad-d2eb-cf4b-94bd-5b7da5d64526"]] , 5 , ["51156d15-40a3-9349-7d85-c83852f2f1eb", "f89e5abc-40e4-b140-f190-e2ad51c274b5", "93c52b11-d765-9741-89b3-6869f6eac854"] , "./etc/articles/e.coli/frag-5.fizz" ) := 1.00 (0.019) 6
-> ( ["9dc15b29-4cd3-b144-8493-5c3ca70c1cee", "30331e58-ae61-ff45-7496-65c29acef854", "eedf5b25-45b4-6c47-ff88-04a934f6ebc2", "3328c60c-a842-8945-63be-ab0762eeb262", "6b14f85e-aad1-d148-efb6-58f4f89ecf93", "62952974-2a70-b647-1ca5-346c4ff67679", "dd0503c8-ac9f-1f4a-a086-6541f62d5781", "5c0c524b-cd92-9c42-23a3-01955a487c84", "8c5e42b1-21dc-8e47-4ba3-480f22cb26cf", "7332b836-ec54-ea45-4aac-90fadea47203", "2b285f0d-ee95-8f47-e3a3-69eca43ef8a9", "3a3c92c2-3995-6f4e-7a83-92b33018f682", "6b0b3dd4-76a2-2746-1f89-af9245803418", "23656e64-c0d0-aa48-22af-d5a3730fe5d7", "e9b1603e-32f6-6a48-0c99-2e38a0d6362e", "51156d15-40a3-9349-7d85-c83852f2f1eb", "f89e5abc-40e4-b140-f190-e2ad51c274b5", "93c52b11-d765-9741-89b3-6869f6eac854", "995f1eeb-5795-5b4d-2bb2-dee45381126b", "eec06c04-2081-db4c-ed90-a84636a8a6cb", "050d91ad-d2eb-cf4b-94bd-5b7da5d64526"] , [["9dc15b29-4cd3-b144-8493-5c3ca70c1cee", "30331e58-ae61-ff45-7496-65c29acef854", "eedf5b25-45b4-6c47-ff88-04a934f6ebc2"], ["3328c60c-a842-8945-63be-ab0762eeb262", "6b14f85e-aad1-d148-efb6-58f4f89ecf93", "62952974-2a70-b647-1ca5-346c4ff67679"], ["dd0503c8-ac9f-1f4a-a086-6541f62d5781", "5c0c524b-cd92-9c42-23a3-01955a487c84", "8c5e42b1-21dc-8e47-4ba3-480f22cb26cf"], ["7332b836-ec54-ea45-4aac-90fadea47203", "2b285f0d-ee95-8f47-e3a3-69eca43ef8a9", "3a3c92c2-3995-6f4e-7a83-92b33018f682"], ["6b0b3dd4-76a2-2746-1f89-af9245803418", "23656e64-c0d0-aa48-22af-d5a3730fe5d7", "e9b1603e-32f6-6a48-0c99-2e38a0d6362e"], ["51156d15-40a3-9349-7d85-c83852f2f1eb", "f89e5abc-40e4-b140-f190-e2ad51c274b5", "93c52b11-d765-9741-89b3-6869f6eac854"], ["995f1eeb-5795-5b4d-2bb2-dee45381126b", "eec06c04-2081-db4c-ed90-a84636a8a6cb", "050d91ad-d2eb-cf4b-94bd-5b7da5d64526"]] , 6 , ["995f1eeb-5795-5b4d-2bb2-dee45381126b", "eec06c04-2081-db4c-ed90-a84636a8a6cb", "050d91ad-d2eb-cf4b-94bd-5b7da5d64526"] , "./etc/articles/e.coli/frag-6.fizz" ) := 1.00 (0.019) 7
save : completed in 0.051s.
save : completed in 0.067s.
save : completed in 0.083s.
save : completed in 0.060s.
save : completed in 0.134s.
save : completed in 0.050s.
save : completed in 0.091s.

We can now restart fizz and this time use the seven files containing the DNA fragments instead of the single file:

$ ./fizz.x64 ./etc/articles/e.coli/genes.fizz ./etc/articles/e.coli/frag-* ./etc/articles/e.coli/import.fizz
fizz 0.3.0-X (20180519.2228) [x64|8|w|l]
load : loading ./etc/articles/e.coli/genes.fizz ...
load : loading ./etc/articles/e.coli/frag-1.fizz ...
load : loading ./etc/articles/e.coli/frag-2.fizz ...
load : loading ./etc/articles/e.coli/frag-0.fizz ...
load : loaded  ./etc/articles/e.coli/frag-2.fizz in 0.969s
load : loaded  ./etc/articles/e.coli/frag-0.fizz in 0.980s
load : loaded  ./etc/articles/e.coli/frag-1.fizz in 1.046s
load : loading ./etc/articles/e.coli/frag-3.fizz ...
load : loading ./etc/articles/e.coli/frag-4.fizz ...
load : loading ./etc/articles/e.coli/frag-5.fizz ...
load : loaded  ./etc/articles/e.coli/frag-3.fizz in 1.043s
load : loaded  ./etc/articles/e.coli/frag-4.fizz in 1.057s
load : loaded  ./etc/articles/e.coli/frag-5.fizz in 0.986s
load : loading ./etc/articles/e.coli/frag-6.fizz ...
load : loading ./etc/articles/e.coli/import.fizz ...
load : loaded  ./etc/articles/e.coli/import.fizz in 0.031s
load : loaded  ./etc/articles/e.coli/genes.fizz in 2.656s
load : loaded  ./etc/articles/e.coli/frag-6.fizz in 0.730s
load : loading completed in 2.948s
?- /stats
stats : e:36 k:32 s:66393 p:15 u:57.54 t:0 q:0 r:0 z:

We now have a much better loading time.

Optimizing retrieval time

Once all the DNA fragments are loaded, let’s see what sort of performance we can get when retrieving a particular fragment based on its identifier:

?- #frag(4528,:s)
-> ( "GCAGCGGCATTCTGCCGGTGATCAACACCGCCATCGCCCATAAAGATGCGGGCGTCGGCATGATTGGCGCGGGCA" ) := 1.00 (0.135) 1
?- #frag(60000,:s)
-> ( "ACAGGCAATTTTTCGGGGATACTGCTCCAGGTAATTATTCGGCTAGGAGTTAAGGCTGTCACACGGATTTGGATG" ) := 1.00 (0.118) 1
?- #frag(1,:s)
-> ( "AACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATA" ) := 1.00 (0.118) 1

As we have spread all the DNA fragments over 21 elementals, when we query the runtime for a specific one, the query is sent concurrently to all. While this is faster than if we had all the sequences in a single elemental, there’s one simple thing we can do to improve the performance: indexing the statements in each elemental based on the first term. This will allow for a faster retrieval of any statements when one of the indexed terms is bound to a value in a query.

To do that, we are going to instruct each of the elementals to setup an index using the poke console command:

?- /poke(frag,index,0)

This gives a value of 0 (the position of the term to be use as the indexing key in the statements) to the index property of all the elemental objects labeled frag. We can now check if this has improved the query performance:

?- #frag(4528,:s)
-> ( "GCAGCGGCATTCTGCCGGTGATCAACACCGCCATCGCCCATAAAGATGCGGGCGTCGGCATGATTGGCGCGGGCA" ) := 1.00 (0.001) 1
?- #frag(60000,:s)
-> ( "ACAGGCAATTTTTCGGGGATACTGCTCCAGGTAATTATTCGGCTAGGAGTTAAGGCTGTCACACGGATTTGGATG" ) := 1.00 (0.001) 1
?- #frag(1,:s)
-> ( "AACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATA" ) := 1.00 (0.001) 1

That’s much better. Note that once an index has been created, the elemental will maintain it when statements are added or removed, so you do not have to poke at it after changes. Multiple indexes are also supported (using a list of indexes).

In order to avoid having to poke each time we load the DNA sequences, we are going to once again save all the frag elementals to file. We will use a different filename so as to keep the non-indexed version:

?- &fzz.lst(frag,:l), #lst.break(:l,3,:l2), lst.item(:l2,:i,:e), str.cat("./etc/articles/e.coli/frag-i-",:i,".fizz",:n), console.exec(save(:n,:e))

Alternatively, we could have edited each of the fizz files with a text editor and manually added the index property for each of the knowledge definitions - such as this:

1frag {index = 0} {
2
3    (24576, "GAAACTGGCACGTCTTATCGAAATCAAAGCCAGCCGCGATGGTCGAGTGGCAGATTACGCCAAAGAATTTGGTCT");
4    (24577, "GGTCTATCTCGAAGGCCAACAGCCGTGGTCTCTACCGGTTGATATCGCCCTGCCTTGCGCCACCCAGAATGAACT");
5    (24578, "GGATGTTGACGCCGCGCATCAGCTTATCGCTAATGGCGTTAAAGCCGTCGCCGAAGGGGCAAATATGCCGACCAC");
6    (24579, "CATCGAAGCGACTGAACTGTTCCAGCAGGCAGGCGTACTATTTGCACCGGGTAAAGCGGCTAATGCTGGTGGCGT");
7    ...
8    ...
9}

We will complete this section by indexing the genes data using the gene’s name as the indexing term (the third term in each statements):

?- #gene(_,_,"feaR",_,_,_,:orient,:s,:e,_,_,_)
-> ( Counterclockwise , 1446378 , 1447283 ) := 1.00 (0.163) 1
?- /poke(gene,index,2)
?- #gene(_,_,"feaR",_,_,_,:orient,:s,:e,_,_,_)
-> ( Counterclockwise , 1446378 , 1447283 ) := 1.00 (0.001) 1
?- /save("./etc/articles/e.coli/genes-i.fizz",gene)
save : completed in 0.151s.

Finally, we also saved the newly modified elementals into a different fizz file.

Finding the famous GATTACA in the genes

To conclude this article, let’s look at something a little more fun: finding all the genes whose DNA contains at least one occurance of the famous GATTACA string (from the 1997 sci-fi movie of the same name). To that end, we are going to have to write some procedural knowledge which, when given a gene’s name, will retreive the complete DNA sequence, and check if it contains a given substring.

To start, create a new fizz file called base.fizz. We will first write a way for us to, given a gene’s name, retrieve the offset and length (in base-pairs) of its DNA sequence, as well as the orientation of the sequence (as we will see later - it matters!):

1gene.offset {
2
3    (:name,:offset,:length,:orient) :-  #gene(_,_,:name,_,_,_,:orient,:s,:e,_,_,_),
4                                        sub(:e,:s,:l2),
5                                        add(:l2,1,:length),
6                                        sub(:s,1,:offset);
7
8}

The first thing we do, is to query (line 3) the gene factual knowledge for any statements matching the gene’s name. Since we only care about a few of the terms, we use the wildcard variable for most of the terms. Once we have the start and end base-pairs, we compute the length of the sequence (line 4 and 5) by substracting the start and end offset, and then adding 1 to it. The result is unified with the variable length which we will return. We end (line 6) by substracting 1 from the start offset, as the base-pair offset starts at 0 for us instead of 1.

Let’s give this a try:

$ ./fizz.x64 ./etc/articles/e.coli/genes-i.fizz ./etc/articles/e.coli/frag-i-*.fizz ./etc/articles/e.coli/base.fizz
fizz 0.3.0-X (20180519.2228) [x64|8|w|l]
load : loading ./etc/articles/e.coli/genes-i.fizz ...
load : loading ./etc/articles/e.coli/frag-i-1.fizz ...
load : loading ./etc/articles/e.coli/frag-i-0.fizz ...
load : loading ./etc/articles/e.coli/frag-i-2.fizz ...
load : loaded  ./etc/articles/e.coli/frag-i-0.fizz in 1.401s
load : loaded  ./etc/articles/e.coli/frag-i-2.fizz in 1.419s
load : loading ./etc/articles/e.coli/frag-i-3.fizz ...
load : loading ./etc/articles/e.coli/frag-i-4.fizz ...
load : loaded  ./etc/articles/e.coli/frag-i-1.fizz in 1.637s
load : loading ./etc/articles/e.coli/frag-i-5.fizz ...
load : loaded  ./etc/articles/e.coli/frag-i-5.fizz in 1.163s
load : loading ./etc/articles/e.coli/frag-i-6.fizz ...
load : loaded  ./etc/articles/e.coli/frag-i-4.fizz in 1.428s
load : loaded  ./etc/articles/e.coli/frag-i-3.fizz in 1.552s
load : loading ./etc/articles/e.coli/base.fizz ...
load : loaded  ./etc/articles/e.coli/base.fizz in 0.020s
load : loaded  ./etc/articles/e.coli/genes-i.fizz in 3.915s
load : loaded  ./etc/articles/e.coli/frag-i-6.fizz in 1.368s
load : loading completed in 4.373s
?- #gene.offset("feaR",:o,:l,:or)
-> ( 1446377 , 906 , Counterclockwise ) := 1.00 (0.001) 1

Next, we are going to assemble the complete gene sequence from an offset, and length, and this is going to be a little bit more tricky. First, we will need to find out which of the 60000+ fragments contains the start of the gene (based on the starting offset we retreived earlier). The following procedural knowledge implements this:

1frag.offset.to.id {
2
3    (:offset,:id,:off) :- div.int(:offset,75,:id), mod(:offset,75,:off);
4
5}

It relies on the fact that each of the fragments (except the last one) contains 75 characters (base-pairs) to compute the ID of the first sequence (using div.int) as well as the actual offset within that sequence using the mod primitive. If we combine that new procedural knowledge with the frag one, we can retrieve the very first fragment from which we will still have to extract the relevant part:

?- #frag.offset.to.id(1446377,:id,:o), #frag(:id,:s)
-> ( 19285 , 2 , "AGTTAGCGGAATTTACGTCGATACTCGCCTGGCGTCATCCCAAAGCGTTGCTTAAATACCGTTGAAAAATGACTC" ) := 1.00 (0.001) 1

Once we have the starting sequence, we are going to write the procedural knowledge that will assemble a complete sequence given the ID of the first fragment, the offset in that fragment and the total length of the sequence:

1frag.get {
2
3    (:id,_,0,[])                ^:- true;
4