[Net 2000 Ltd. Home][DataBee Home][DataBee Manual][DataBee FAQ]

Extraction Set Explosions

What is an Extraction Set Explosion?

An extraction set explosion is said to occur when the the rules and/or driver rows are such that the only referentially correct subset of the full size database is all or nearly all of the rows in that database. In other words, the only way the rules and initial conditions can extract a referentially correct subset is to extract nearly all of the existing data. The word "Explosion" in the name simply refers to the ever increasing quantity of extracted rows - not to any physical damage.

It is important to note that an extraction set explosion is not a bug in the sense that the DataBee software is doing something wrong. The software is doing exactly what is necessary to extract a referentially correct set of rows. The problem lies with the definition of the extraction rules - the relationships they define are causing too much data to be extracted and it is the rules that need to be adjusted.

How do I tell if an Extraction Set Explosion is Occurring?

It is pretty easy to detect. As you watch the Set Extractor work, you will see the row counts continue to climb. If you see them climb to extraordinary high numbers the extraction set is clearly in trouble. How high is too high? Well like many things associated with creating database subsets, it is not possible to make definitive statements. However, once the extracted row count seems to exceed that which is reasonable a closer look is recommended. For example, if your driver rows consist of 10% of a particular table you might reasonably expect to extract something like 10% of the rows from most tables. If most tables are having over 90% of their rows extracted then something is clearly wrong.

Another way to tell is to look at the extracted row count in your driver table. It is not uncommon to see more rows extracted for the driver table than were originally requested. However if the extracted row count starts to continuously climb then you probably have an extraction set explosion. For example, say you are attempting to extract a subset based on 100 customers from a 250,000 row customer table. You start the Set Extractor running and as you watch the extracted row count for the customer table it will initially start at 100. These are the driver rows pulled by the Where Clause rule. Now some of the customers might require other customer records in order to maintain referential correctness. If so, you might see the extracted rows for the customer table climb to 150 or even 1000. However if, as you watch the Set Extractor execute, you see the extracted rows for the customer table climb to values like 100,000 or more then clearly the rules are looping back on themselves and pulling much more data than is desired. This is an extraction set explosion and the rules will need to be adjusted in order to produce a viable subset.

How do I Stop Extraction Set Explosions From Happening?

  1. Follow the recommended methodology for adding rules when building the extraction set. In other words, do not add multiple rules and hope for the best. The best way is to add a rule, run the set, note the effect on the extracted row count and then add another rule. If the newly added rule causes an extraction set explosion then disable or delete it.

  2. Read the explanations below regarding the causes of extraction set explosions and when you are adding rules pay attention to the way the relationships function. In other words, knowing your data is one of the best ways to avoid extraction set explosions.

  3. Take the time to figure out if a Table-To-Table rule you are adding is the best way to extract rows for a table. Some rules are better than others and some have very unfortunate effects.

How do Extraction Set Explosions Happen?

There are a number of causes of extraction set explosions. The following list contains the some common sources.

  1. The driver rows are too general. This cause of extraction set explosions is pretty rare since most people put quite a bit of thought into the choice of driver rows. However it can happen, and so will get a brief mention here.

    For example, lets assume we have a database with a large customer table and all customers belong to one of two groups: business and consumer. The business customers, although representing a large part of the sales, consist of only 1% of the customer records. If the driver condition for the subset database specifies that the subset should consist of all consumer customers then when the extract takes place 99% of the data in the source will be extracted. This looks like an extraction set explosion since but in reality it is just the result of an unfortunate choice of initial driver rows.

  2. A direct or indirect loopback effect causes rows extracted in a table to require the extraction of rows in the same table. Sometimes the data items in a table have a relationship chain with themselves. This is most commonly seen with a self-referential foreign key. In such cases, the child data item requires the parent data item which then requires further data items and so a single extracted row in a table can require the presence of many others in order to preserve referential correctness. Self-referential foreign keys are no problem for the DataBee software which adjusts its extraction algorythm to ensure that the required data is extracted. However, this chaining effect can cause problems as it leads to the extraction of far more data than was expected.

    If the self referential relationship is a foreign key then there is not much that can be done about the resulting extraction set explosion other than carefully choosing the driver data so that very few of the relationship chains are chosen. Note that disabling the Table-To-Table rule enforcing the foreign key will stop the extraction set explosion. But that foreign key will not enable in the resulting target subset database.

    The same feedback effect can sometimes be introduced into an extraction set through the manual addition of rules. This can happen directly where a single rule requires rows from the same table to support the rows already extracted for that table. More commonly, the feedback relationship is through the involvement of one or more other tables. For example, a table A requires rows to be extracted from the table B in order to maintain referential integrity. A manual Table-To-Table rule is added which requires rows to be extracted from Table A for each row in table B. If the data relationship is such that the extracted rows in table B require new, previously unextracted, rows from table A then it is possible an extraction set explosion will develop as the new rows extracted for table A require yet more rows to be extracted from table B which then require even more rows from table A ... and so on.

    It is possible to have multiple tables involved in the feedback cycle. Extraction set explosions generated through the interactions of multiple tables can be quite difficult to diagnose.

  3. A rule is built referencing low cardinality data. Low cardinality data is data that, although it may have many occurrences (appears in many rows), does not have much variation. An example of this would be a GENDER column. Typically this would only have one of two values "M" or "F". There may be millions of rows but the value of each row will be one of the two options.

    So, as a simplistic example, lets assume a subset of an INVOICE table was required. Each line in the invoice table has a TAX_ID column which indicates the sales taxes applicable to that item. Now, there are not many TAX_ID's and the permitted values are referenced in a table called TAX_CODES. To make sure that nobody can enter in unauthorized tax codes there is a foreign key relationship between the TAX_CODES table and the TAX_ID column in the invoice table. DataBee treats this foreign key as a rule so for each row extracted in the INVOICE table it will make sure the supporting row from the TAX_CODE table is extracted. Mostly these rows will be duplicates (which DataBee sifts out) since there are many rows extracted in the INVOICE table and they all use the same TAX_ID values. Now, if a poorly designed rule was manually placed on the TAX_CODE table which required that for every extracted TAX_ID row, the rows in the INVOICE table with that TAX_ID were required then an extraction set explosion would occur. The TAX_CODE table has few values and the TAX_ID field in INVOICE table is low cardinality. For example if a record with a TAX_ID of 15% was extracted, then the matching record in the TAX_CODE table would have to be extracted. So far everything is ok. However, the second poorly designed rule joining the low cardinality columns will activate. It will then require that all rows with a TAX_ID of 15% in the INVOICE table must be extracted and an extraction set explosion will result.

    Note that it is rare to see a set of extraction rules derived entirely from foreign keys cause this sort of problem. It is almost always the manual addition of poorly conceived rules which cause the problem.

  4. A meaningless join. Sometimes a join which seems like a good idea is not. For example consider a situation in which there are two tables: CUSTOMER and PRODUCT. Both tables have a column entitled GROUP_ID which contains a four digit number. You might be tempted to set up a rule which requires that for every extracted CUSTOMER extract the PRODUCT rows where CUSTOMER.GROUP_ID=PRODUCT.GROUP_ID. If the GROUP_ID means entirely different things in the two tables then an extraction set explosion may well result. For example, if a CUSTOMER group is not the same thing as a PRODUCT group then any identical four digit number is purely a coincidence. There may well be some matching six digit values but the relationship is accidental. As the rule activates it will extract far more, and completely unrelated, PRODUCT records than is required. This in turn may cause further problems as supporting rows for the extracted PRODUCT rows are extracted in other tables and an extraction set explosion can result.

  5. All Rows rules have been added to tables which have dependent rules. Many times tables need to be extracted in full, even if the rows are not strictly required to referentially support the other extracted rows. This is usually done in order to ensure that relevant reference data is present in the target subset database. An example of this would be the extraction of reference tables required to provide front end screens with a list of options. One might require all TAX_CODES to be present - even if some of those codes were never referenced by the extracted data.

    The problem with using All Rows rules in such situations is that in extraction sets is that it is often not clear which tables are simple reference tables which can be extracted in full without complications and which ones have potential "knock on effects" when extracted in full. For example, if a table has rules applied to it which require that for every row extracted for that table then other supporting rows from other tables must also be extracted then extracting all rows from that table with an All Rows rule will probably also extract all rows from the dependent tables. A tool called the Safe All Rows Rule Generator tool exists in the Set Designer which enables the creation of All Rows rules which are inherently "safe" - in other words the addition of All Rows rules to those tables will cause no extra rows to be extracted from other tables. It is always a much better to add All Rows rules with the Safe All Rows Rule Generator tool than to add them manually.


[Net 2000 Ltd. Home][DataBee Home][DataBee Manual][DataBee FAQ]