An extraction set explosion is said to occur when the the rules and/or driver rows are such that the only referentially correct subset of the full size database is all or nearly all of the rows in that database. In other words, the only way the rules and initial conditions can extract a referentially correct subset is to extract nearly all of the existing data. The word "Explosion" in the name simply refers to the ever increasing quantity of extracted rows - not to any physical damage.
It is important to note that an extraction set explosion is not a bug in the sense that the DataBee software is doing something wrong. The software is doing exactly what is necessary to extract a referentially correct set of rows. The problem lies with the definition of the extraction rules - the relationships they define are causing too much data to be extracted and it is the rules that need to be adjusted.
How do I tell if an Extraction Set Explosion is Occurring?
It is pretty easy to detect. As you watch the Set Extractor work, you will see the row counts continue to climb. If you see them climb to extraordinary high numbers the extraction set is clearly in trouble. How high is too high? Well like many things associated with creating database subsets, it is not possible to make definitive statements. However, once the extracted row count seems to exceed that which is reasonable a closer look is recommended. For example, if your driver rows consist of 10% of a particular table you might reasonably expect to extract something like 10% of the rows from most tables. If most tables are having over 90% of their rows extracted then something is clearly wrong.
Another way to tell is to look at the extracted row count in your driver table. It is not uncommon to see more rows extracted for the driver table than were originally requested. However if the extracted row count starts to continuously climb then you probably have an extraction set explosion. For example, say you are attempting to extract a subset based on 100 customers from a 250,000 row customer table. You start the Set Extractor running and as you watch the extracted row count for the customer table it will initially start at 100. These are the driver rows pulled by the Where Clause rule. Now some of the customers might require other customer records in order to maintain referential correctness. If so, you might see the extracted rows for the customer table climb to 150 or even 1000. However if, as you watch the Set Extractor execute, you see the extracted rows for the customer table climb to values like 100,000 or more then clearly the rules are looping back on themselves and pulling much more data than is desired. This is an extraction set explosion and the rules will need to be adjusted in order to produce a viable subset.
How do I Stop Extraction Set Explosions From Happening?
There are a number of causes of extraction set explosions. The following list contains the some common sources.
For example, lets assume we have a database with a large customer table and all customers belong to one of two groups: business and consumer. The business customers, although representing a large part of the sales, consist of only 1% of the customer records. If the driver condition for the subset database specifies that the subset should consist of all consumer customers then when the extract takes place 99% of the data in the source will be extracted. This looks like an extraction set explosion since but in reality it is just the result of an unfortunate choice of initial driver rows.
If the self referential relationship is a foreign key then there is not much that can be done about the resulting extraction set explosion other than carefully choosing the driver data so that very few of the relationship chains are chosen. Note that disabling the Table-To-Table rule enforcing the foreign key will stop the extraction set explosion. But that foreign key will not enable in the resulting target subset database.
The same feedback effect can sometimes be introduced into an extraction set through the manual addition of rules. This can happen directly where a single rule requires rows from the same table to support the rows already extracted for that table. More commonly, the feedback relationship is through the involvement of one or more other tables. For example, a table A requires rows to be extracted from the table B in order to maintain referential integrity. A manual Table-To-Table rule is added which requires rows to be extracted from Table A for each row in table B. If the data relationship is such that the extracted rows in table B require new, previously unextracted, rows from table A then it is possible an extraction set explosion will develop as the new rows extracted for table A require yet more rows to be extracted from table B which then require even more rows from table A ... and so on.
It is possible to have multiple tables involved in the feedback cycle. Extraction set explosions generated through the interactions of multiple tables can be quite difficult to diagnose.
So, as a simplistic example, lets assume a subset of an INVOICE table was required. Each line in the invoice table has a TAX_ID column which indicates the sales taxes applicable to that item. Now, there are not many TAX_ID's and the permitted values are referenced in a table called TAX_CODES. To make sure that nobody can enter in unauthorized tax codes there is a foreign key relationship between the TAX_CODES table and the TAX_ID column in the invoice table. DataBee treats this foreign key as a rule so for each row extracted in the INVOICE table it will make sure the supporting row from the TAX_CODE table is extracted. Mostly these rows will be duplicates (which DataBee sifts out) since there are many rows extracted in the INVOICE table and they all use the same TAX_ID values. Now, if a poorly designed rule was manually placed on the TAX_CODE table which required that for every extracted TAX_ID row, the rows in the INVOICE table with that TAX_ID were required then an extraction set explosion would occur. The TAX_CODE table has few values and the TAX_ID field in INVOICE table is low cardinality. For example if a record with a TAX_ID of 15% was extracted, then the matching record in the TAX_CODE table would have to be extracted. So far everything is ok. However, the second poorly designed rule joining the low cardinality columns will activate. It will then require that all rows with a TAX_ID of 15% in the INVOICE table must be extracted and an extraction set explosion will result.
Note that it is rare to see a set of extraction rules derived entirely from foreign keys cause this sort of problem. It is almost always the manual addition of poorly conceived rules which cause the problem.
The problem with using All Rows rules in such situations is that in extraction sets is that it is often not clear which tables are simple reference tables which can be extracted in full without complications and which ones have potential "knock on effects" when extracted in full. For example, if a table has rules applied to it which require that for every row extracted for that table then other supporting rows from other tables must also be extracted then extracting all rows from that table with an All Rows rule will probably also extract all rows from the dependent tables. A tool called the Safe All Rows Rule Generator tool exists in the Set Designer which enables the creation of All Rows rules which are inherently "safe" - in other words the addition of All Rows rules to those tables will cause no extra rows to be extracted from other tables. It is always a much better to add All Rows rules with the Safe All Rows Rule Generator tool than to add them manually.