This isn’t a question for anyone to answer. I thought I’d share the
knowledge gained from a little adventure I had recently. On that note, if anyone
would like to offer observations or ask questions (which I will do my best to
answer from the testing I observed), feel welcome. Additionally, it’s long so
you might want to wait for a break to read through it.
Task: Take a flat file data set and convert it into a nested data set structure.
Attributes of challenge:
i) This is a non-table based dataset. The data is from an external source, fed
to Oracle one record at a time.
ii) The data is 4 levels deep (grand-child, child, parent, grand-parent)
iii) There is no unique value that defines any level
Environment: Oracle 10g with object types created.
In case you are curious, it is the data provided by Canada Post that outline
Episode 1: PLSQL Code, loops, increment variable counts, etc.
At first I tried to code that exchange and really didn’t like it due to
the level of complexity that was creeping in due to the nature of the data. For
example, how do you define what “parent” a given unit range belongs
to? How do you go about looking up the parent information for a specific child?
Episode 2: Straight SQL
I then quickly decided that particular strategy needed to change so I went
looking for SQL to do the task I required. I came across the following
Data Source, 3 records consisting of:
Mgr Id~~ Mgr Name~~~~~~ Department~~ Emp Id~~ Emp Name~~~~~~~~ Job Title
1~~~~~~~ Bugs Bunny~~~~ R&D~~~~~~~~~ A~~~~~~~ Elmer Fudd~~~~~~ Hunter
1~~~~~~~ Bugs Bunny~~~~ R&D~~~~~~~~~ B~~~~~~~ Elmer Fudd~~~~~~ Photographer
2~~~~~~~ Yosemite Sam~~ Sabotage~~~~ B~~~~~~~ Wile E. Coyote~~ Saboteur
Something to note: Although I used column titles which are usually associated
with unique values, as you can see with Elmer, what defines “a unique record” is
“all the columns from all the levels down to that level all together”. This may
be difficult to understand till you realize that addresses fit that mold. Two
different cities could both easily have a 50th street yet the street is commonly
associated as a child to the parent of the city. So… the data is correct as-is
from a “flat-file, comma-delimited” type structure.
*** code start, ~ = white space (which yahoo just loves to remove) ***
~~~~mgr_name, ~~~~department, ~~~~CAST ~~~~~~~(MULTISET ~~~~~~~~~~(SELECT employee_obj_typ ~~~~~~~~~~~~~(emp_id, ~~~~~~~~~~~~~~emp_name, ~~~~~~~~~~~~~~job_title) ~~~~~~~~~~~~~FROM managed_employees i_me ~~~~~~~~~~~~WHERE i_me.mgr_id = o_me.mgr_id ~~~~~~~~~~) AS employee_col_typ ~~~~~~~) ~~~) ~~FROM managed_employees o_me *** code end *** The above ends up forming nested records along the lines: Mgr Id~~ Mgr Name~~~~~ Department~~ (Emp Id~~ Emp Name~~~~~~~ Job Title) 1~~~~~~~~Bugs Bunny~~~ R&D~~~~~~~~~ (A~~~~~~~ Elmer Fudd~~~~~ Hunter) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ B~~~~~~~ Elmer Fudd~~~~~ Photographer) 1~~~~~~~~Bugs Bunny~~~ R&D~~~~~~~~~ (A~~~~~~~ Elmer Fudd~~~~~ Hunter) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ B~~~~~~~ Elmer Fudd~~~~~ Photographer) 2~~~~~~~~Yosemite Sam~ Sabotage~~~~ (B~~~~~~~ Wile E. Coyote~ Saboteur) As you can see from the dataset produced. The "children" are nested properly with the parent. However, with a greater dataset to test with, one would quickly find a bug that is not immediately apparent. Additionally, there is a "bug" that is immediately apparent. Bug #1: Multiple records returned (1 for each in the set), how do you ensure a unique data set? Bug #2: I only used the id as a unique identifier, one has to use the full set of the parents columns in order to link the parent/child relationship properly. I found three possible methods to eliminate duplicates where you utilize a union. I didn't much care for one of them as it required forming a more complex query and utilizing a UNION/INTERSECT type statement Two of the possible methods I was semi-satisfied with. i) You can use a grouping clause at each level to provide unique data. Again, all columns that define that level (including parent columns) must be identified at that level. ii) You can use the SET command to identify unique values, it must be done at all levels and can only be applied against a collection type UDT, as an example, the inner select count could have SET( CAST( MULTISET( SELECT ... ) AS employee_col_typ) ). Note: SET is applied against the employee_col_typ definition. That looked promising. It also appeared quite efficient processing 30,000 rows in 0.10 seconds. However, instead of a table, I had a collection (of object type) set I needed it to work against. So I re-wrote it with the inner reference to convert the set so I could run SQL against it: FROM TABLE( CAST( in_address_col AS psr_address_col_typ ) ) That seemed promising.... initially.... against a half dozen rows. I quickly ran into performance issues. On a simple dataset of a mgr_id, mgr_name, emp_id, emp_name, job_id, job_name (all populated with numeric count values), it quickly failed to allocate resources when dealing with a mere 1,000 records. Additionally, while SET and GROUP BY formed unique data from the table, both techniques failed with the collection definition. I could never get either to return a unique set. SET appeared to work (compiled, ran) but the data was not unique. "Group by" failed giving an error indicating that the fields in the group by clause were not part of the select columns. Episode 3: Back to PLSQL, redesign and use the power of collection index datatypes as well as sparse collections. The process is simple (in hindsight) in design: Step I) Separate each grouping into its own collection. Use the "unique index" (all the columns up to and including that level) for the index value. This creates a sparse collection for each set. Include a "numeric index value" field for that level plus each parent level above it. Include a numeric count field for each child collection expected for that level. When populating the child record, ensure the parent unique varchar index is populated on the child record. Data example of this information provided in the dataset a bit further down. This performs a single loop through the base collection to set all of the collections and default counts and unique varchar index values Step II) Populate the numeric index columns. The first level is simple, loop through the sparse, unique collection and simply increment each numeric idx value by one. Each child level needs to look up the numeric idx value of all its parent levels to increment those values, increment the counter of the parent and set the numeric index of the child to the parent counter. Remember, the self-collections (mgr collection for only mgr, etc) are all sparsely populated at this point. This performs a single loop through each of the level collections to set all numeric index values and child count values Step III) Form the nested object. I did this one level at a time starting with the master record. For example, use the count of the mgr collection to extend the default nested collection master level. With the two managers it'd be mgr_nested_col.extend(2). Then populate each record based on its particular numeric index and the numeric indexes of its parents. For example: The following information would be in the fully populated mgr collection, (using ^'s to represent white space): Collection index^^^^ fields in collection Unique_v_idx:^^^^^^^^^^^^^^ Mgr_emp_cnt^^ Mgr_nbr_idx^^ other_fields_like_mgr_id '1~Bugs Bunny~R&D'^^^^^^^^^ 2^^^^^^^^^^^^ 1^^^^^^^^^^^^ not important to know for the structure '2~Yosemite Sam~Sabotage'^^ 1^^^^^^^^^^^^ 2 The following information would be in the populated emp collection: Unique_v_idx:^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Mgr_nbr_idx^^ Emp_nbr_idx^^ other_emp_fields '1~Bugs Bunny~R&D~A~Elmer Fudd~Hunter'^^^^^^^^^^^^^^^ 1^^^^^^^^^^^^ 1 '1~Bugs Bunny~R&D~B~Elmer Fudd~Photographer'^^^^^^^^^ 1^^^^^^^^^^^^ 2 '2~Yosemite Sam~Sabotage~B~Wile E. Coyote~Saboteur'^^ 2^^^^^^^^^^^^ 1 This performs a single loop through each of the level collections to populate the appropriate values. Using that information, I can now set (using all variables of course): Nested_collection(1).mgr_name := 'Bugs Bunny'; Nested_collection(1).emp_list(1).emp_name := 'Elmer Fudd'; Nested_collection(1).emp_list(2).job_name := 'Photographer'; Everything I need to know to build the nested collection is now part of the individual collections. End result: While the final solution does not perform quite as well as the SQL against a table set (1.3 seconds vs .1 seconds to process 30,000 records), it does perform acceptably. What I suspect was happening with the SQL applied against a collection set is an exponential sorting. So total of 10 records, nested 4 levels (dep + mgr + emp + job) is 10*10*10*10 = 10,000 rows of processing for 10 real rows. Now imagine 1000 rows, 3 levels deep (what I tested with) = 1000 * 1000 * 1000. 100,000,000 = ouch! With the PLSQL, I've converted that into being additive. With the 1,000 rows, a max total of 9,000 rows to process rather than 100 million. With a "flat-table" of 10 records and 4 levels, the breakdown: Step 1: 10 rows - separates dpt, mgr, emp and job collections Step 2: 10 (dpt, 10 max, depending on uniqueness) + 10 (mgr) + 10 (emp) + 10 (job) Max total of 40 rows of processing for 10 records. For example, if 1 unique department, that drops it down to 1 + 10 + 10 + 10 = 31. Step 3: 10 (dpt) + 10 (mgr) + 10 (emp) + 10 (job) Max total = 40 All steps together: 10 (for sure) + 40 (max, possibly as low as 1 (dpt) + 1 (mgr) + 1 (emp) + 10 (jobs) = 13 + 40 (max, 13 as lowest possibility) Final result = 90 (36 on the low end) rows of processing against 10 rows in PLSQL. Definitely not the greatest design but at the same time it could be a lot worse. Anyway, I thought I'd share the above as I'm making use of a couple of the latest additions of functionality from Oracle. Roger S.