Subject: Taking a flat data set and converting it to a nested structure - Warning, Long read

This isn’t a question for anyone to answer. I thought I’d share the
knowledge gained from a little adventure I had recently. On that note, if anyone
would like to offer observations or ask questions (which I will do my best to
answer from the testing I observed), feel welcome. Additionally, it’s long so
you might want to wait for a break to read through it.

Task: Take a flat file data set and convert it into a nested data set structure.

Attributes of challenge:

i) This is a non-table based dataset. The data is from an external source, fed
to Oracle one record at a time.

ii) The data is 4 levels deep (grand-child, child, parent, grand-parent)

iii) There is no unique value that defines any level

Environment: Oracle 10g with object types created.

In case you are curious, it is the data provided by Canada Post that outline
“valid addresses”.

Episode 1: PLSQL Code, loops, increment variable counts, etc.

At first I tried to code that exchange and really didn’t like it due to
the level of complexity that was creeping in due to the nature of the data. For
example, how do you define what “parent” a given unit range belongs
to? How do you go about looking up the parent information for a specific child?

Episode 2: Straight SQL

I then quickly decided that particular strategy needed to change so I went
looking for SQL to do the task I required. I came across the following
structure:

Data Source, 3 records consisting of:

Mgr Id~~ Mgr Name~~~~~~ Department~~ Emp Id~~ Emp Name~~~~~~~~ Job Title

1~~~~~~~ Bugs Bunny~~~~ R&D~~~~~~~~~ A~~~~~~~ Elmer Fudd~~~~~~ Hunter

1~~~~~~~ Bugs Bunny~~~~ R&D~~~~~~~~~ B~~~~~~~ Elmer Fudd~~~~~~ Photographer

2~~~~~~~ Yosemite Sam~~ Sabotage~~~~ B~~~~~~~ Wile E. Coyote~~ Saboteur

Something to note: Although I used column titles which are usually associated
with unique values, as you can see with Elmer, what defines “a unique record” is
“all the columns from all the levels down to that level all together”. This may
be difficult to understand till you realize that addresses fit that mold. Two
different cities could both easily have a 50th street yet the street is commonly
associated as a child to the parent of the city. So… the data is correct as-is
from a “flat-file, comma-delimited” type structure.

*** code start, ~ = white space (which yahoo just loves to remove) ***

SELECT manager_obj_typ


~~~~mgr_name,

~~~~department,

~~~~CAST

~~~~~~~(MULTISET

~~~~~~~~~~(SELECT employee_obj_typ

~~~~~~~~~~~~~(emp_id,

~~~~~~~~~~~~~~emp_name,

~~~~~~~~~~~~~~job_title)

~~~~~~~~~~~~~FROM managed_employees i_me

~~~~~~~~~~~~WHERE i_me.mgr_id = o_me.mgr_id

~~~~~~~~~~) AS employee_col_typ

~~~~~~~)

~~~)

~~FROM managed_employees o_me

*** code end ***

The above ends up forming nested records along the lines:

Mgr Id~~ Mgr Name~~~~~ Department~~ (Emp Id~~ Emp Name~~~~~~~ Job Title)

1~~~~~~~~Bugs Bunny~~~ R&D~~~~~~~~~ (A~~~~~~~ Elmer Fudd~~~~~ Hunter)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ B~~~~~~~ Elmer Fudd~~~~~ Photographer)

1~~~~~~~~Bugs Bunny~~~ R&D~~~~~~~~~ (A~~~~~~~ Elmer Fudd~~~~~ Hunter)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ B~~~~~~~ Elmer Fudd~~~~~ Photographer)

2~~~~~~~~Yosemite Sam~ Sabotage~~~~ (B~~~~~~~ Wile E. Coyote~ Saboteur)

As you can see from the dataset produced. The "children" are nested properly
with the parent. However, with a greater dataset to test with, one would quickly
find a bug that is not immediately apparent. Additionally, there is a "bug" that
is immediately apparent.

Bug #1: Multiple records returned (1 for each in the set), how do you ensure a
unique data set?

Bug #2: I only used the id as a unique identifier, one has to use the full set
of the parents columns in order to link the parent/child relationship properly.

I found three possible methods to eliminate duplicates where you utilize a
union. I didn't much care for one of them as it required forming a more complex
query and utilizing a UNION/INTERSECT type statement

Two of the possible methods I was semi-satisfied with.

i) You can use a grouping clause at each level to provide unique data. Again,
all columns that define that level (including parent columns) must be identified
at that level.

ii) You can use the SET command to identify unique values, it must be done at
all levels and can only be applied against a collection type UDT, as an example,
the inner select count could have SET( CAST( MULTISET( SELECT ... ) AS
employee_col_typ) ). Note: SET is applied against the employee_col_typ
definition.

That looked promising. It also appeared quite efficient processing 30,000 rows
in 0.10 seconds. However, instead of a table, I had a collection (of object
type) set I needed it to work against. So I re-wrote it with the inner reference
to convert the set so I could run SQL against it: FROM TABLE( CAST(
in_address_col AS psr_address_col_typ ) )

That seemed promising.... initially.... against a half dozen rows.

I quickly ran into performance issues. On a simple dataset of a mgr_id,
mgr_name, emp_id, emp_name, job_id, job_name (all populated with numeric count
values), it quickly failed to allocate resources when dealing with a mere 1,000
records.

Additionally, while SET and GROUP BY formed unique data from the table, both
techniques failed with the collection definition. I could never get either to
return a unique set. SET appeared to work (compiled, ran) but the data was not
unique. "Group by" failed giving an error indicating that the fields in the
group by clause were not part of the select columns.

Episode 3: Back to PLSQL, redesign and use the power of collection index
datatypes as well as sparse collections.

The process is simple (in hindsight) in design:

Step I) Separate each grouping into its own collection. Use the "unique index"
(all the columns up to and including that level) for the index value. This
creates a sparse collection for each set. Include a "numeric index value" field
for that level plus each parent level above it. Include a numeric count field
for each child collection expected for that level. When populating the child
record, ensure the parent unique varchar index is populated on the child record.

Data example of this information provided in the dataset a bit further down.

This performs a single loop through the base collection to set all of the
collections and default counts and unique varchar index values

Step II) Populate the numeric index columns. The first level is simple, loop
through the sparse, unique collection and simply increment each numeric idx
value by one. Each child level needs to look up the numeric idx value of all its
parent levels to increment those values, increment the counter of the parent and
set the numeric index of the child to the parent counter. Remember, the
self-collections (mgr collection for only mgr, etc) are all sparsely populated
at this point.

This performs a single loop through each of the level collections to set all
numeric index values and child count values

Step III) Form the nested object.

I did this one level at a time starting with the master record. For example, use
the count of the mgr collection to extend the default nested collection master
level. With the two managers it'd be mgr_nested_col.extend(2). Then populate
each record based on its particular numeric index and the numeric indexes of its
parents.

For example:

The following information would be in the fully populated mgr collection, (using
^'s to represent white space):

Collection index^^^^ fields in collection

Unique_v_idx:^^^^^^^^^^^^^^ Mgr_emp_cnt^^ Mgr_nbr_idx^^ other_fields_like_mgr_id

'1~Bugs Bunny~R&D'^^^^^^^^^ 2^^^^^^^^^^^^ 1^^^^^^^^^^^^ not important to know
for the structure

'2~Yosemite Sam~Sabotage'^^ 1^^^^^^^^^^^^ 2

The following information would be in the populated emp collection:

Unique_v_idx:^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Mgr_nbr_idx^^
Emp_nbr_idx^^ other_emp_fields

'1~Bugs Bunny~R&D~A~Elmer Fudd~Hunter'^^^^^^^^^^^^^^^ 1^^^^^^^^^^^^ 1

'1~Bugs Bunny~R&D~B~Elmer Fudd~Photographer'^^^^^^^^^ 1^^^^^^^^^^^^ 2

'2~Yosemite Sam~Sabotage~B~Wile E. Coyote~Saboteur'^^ 2^^^^^^^^^^^^ 1

This performs a single loop through each of the level collections to populate
the appropriate values.

Using that information, I can now set (using all variables of course):

Nested_collection(1).mgr_name := 'Bugs Bunny';

Nested_collection(1).emp_list(1).emp_name := 'Elmer Fudd';

Nested_collection(1).emp_list(2).job_name := 'Photographer';

Everything I need to know to build the nested collection is now part of the
individual collections.

End result:

While the final solution does not perform quite as well as the SQL against a
table set (1.3 seconds vs .1 seconds to process 30,000 records), it does perform
acceptably.

What I suspect was happening with the SQL applied against a collection set is an
exponential sorting. So total of 10 records, nested 4 levels (dep + mgr
+ emp + job) is 10*10*10*10 = 10,000 rows of processing for 10 real
rows. Now imagine 1000 rows, 3 levels deep (what I tested with) = 1000 * 1000 *
1000. 100,000,000 = ouch!

With the PLSQL, I've converted that into being additive. With the 1,000 rows, a
max total of 9,000 rows to process rather than 100 million.

With a "flat-table" of 10 records and 4 levels, the breakdown:

Step 1:

10 rows - separates dpt, mgr, emp and job collections

Step 2:

10 (dpt, 10 max, depending on uniqueness)

+ 10 (mgr)

+ 10 (emp)

+ 10 (job)

Max total of 40 rows of processing for 10 records.

For example, if 1 unique department, that drops it down to 1 + 10 + 10
+ 10 = 31.

Step 3:

10 (dpt)

+ 10 (mgr)

+ 10 (emp)

+ 10 (job)

Max total = 40

All steps together:

10 (for sure)

+ 40 (max, possibly as low as 1 (dpt) + 1 (mgr) + 1 (emp) + 10
(jobs) = 13

+ 40 (max, 13 as lowest possibility)

Final result = 90 (36 on the low end) rows of processing against 10 rows in
PLSQL. Definitely not the greatest design but at the same time it could be a lot
worse.

Anyway, I thought I'd share the above as I'm making use of a couple of the
latest additions of functionality from Oracle.

Roger S.