SQL*Loader Preserving Hierarchy

13 June, 2014

Every time I go to build a SQL*Loader control file, I find out that it can do things I didn't consider and can't do things that seemed obvious that it should.  Somewhere between removing tufts of hair with fists and jumping up and down with joy, things finally get done.  I find the lack of examples and documentation to be particularly frustrating.  Or maybe I'm awful at Google.  With peer assistance and Internetting, I finally got things to do my bidding.  Anyway, I thought I'd put down some learnings in wordings to help prevent some headaches for others and myself when I forget things.

Background

So my task was to get a delimited, hierarchical file loaded into a set of tables.  I wanted to avoid just dumping everything into a table and running a complicated stored procedure over the top of it, and the file was way to big to map into memory.  And I didn't want to make a complicated process to break it into smaller chunks for tracking and time purposes.  I'll mock up some completely contrived examples for the problem and solution explanation.

Example Data

Mind that this data is made up and simpler than my actual probably, but it is structured in a way that is on purpose and will illustrate the point without being excessively complex.

There are records for each person as indicated by the first field PERSON.  Under each person are groupings of records that tie to the person:  PERSON_PHONE, and PERSON_ADDRESS.  Notice that there are more than one of these records.  In theory you would be able to have any number of phone numbers or addresses for each person, and you can have any number of people.

PERSON|123|Robert||Robertson|
PERSON_PHONE|123|1115557777|CELLULAR
PERSON_PHONE|123|1115557772|WORK
PERSON_ADDRESS|123|555 Robin Ln||St. Robert|CA|88888|HOME
PERSON_ADDRESS|123|222 Bobbypin Circle|Suite 211|St. Robert|CA|88888|WORK
PERSON|123|William|W|Williams|Jr.
PERSON_PHONE|123|1115556666|HOME
PERSON_PHONE|123|1115556622|WORK
PERSON_ADDRESS|123|212 Wiley St. E||St. Robert|CA|88888-8888|WORK
PERSON_ADDRESS|123|444 Willie St|Apt 222|St. Robert|CA|88888|HOME

Data Model

These are the made-up tables for demonstration.  Again, this is totally contrived, so please just ignore that I am being design and syntax-lazy.

CREATE TABLE PERSON (
PERSON_ID NUMBER NOT NULL PRIMARY KEY,
FILE_ID NUMBER,
FIRST_NAME VARCHAR2,
MIDDLE_NAME VARCHAR2,
LAST_NAME VARCHAR2,
);

CREATE TABLE PERSON_PHONE (
PERSON_PHONE_ID NUMBER NOT NULL PRIMARY KEY,
PERSON_ID NUMBER FOREIGN KEY REFERENCES PERSON (PERSON_ID),
FILE_ID NUMBER,
PHONE_NUMBER VARCHAR2,
PHONE_TYPE VARCHAR2
);

CREATE TABLE PERSON_ADDRESS (
PERSON_ADDRESS_ID NUMBER NOT NULL PRIMARY KEY,
PERSON_ID NUMBER FOREIGN KEY REFERENCES PERSON (PERSON_ID),
FILE_ID NUMBER,
STREET VARCHAR2,
SUITE VARCHAR2,
CITY VARCHAR2,
STATE VARCHAR2,
ZIP VARCHAR2
);

Failed First Attempt

Disclaimer:  This does not work.

The first thing I tried was to use sequences.  Bang-pow, done.  Shut down everything and go home.

Control File:

LOAD DATA
CHARACTERSET WE8MSWIN1252

APPEND

INTO TABLE PERSON
WHERE (1) = 'PERSON'
FIELDS TERMINATED BY '|' TRAILING NULLCOLS
(
RECORD_TYPE FILLER,
FILE_ID,
FIRST_NAME,
MIDDLE_NAME,
LAST_NAME,
PERSON_ID "PERSON_SEQUENCE.NEXTVAL"
)

INTO TABLE PERSON_PHONE
WHERE (1) = 'PERSON_PHONE'
FIELDS TERMINATED BY '|' TRAILING NULLCOLS
(
RECORD_TYPE FILLER,
FILE_ID,
PHONE_NUMBER,
PHONE_TYPE,
PERSON_PHONE_ID "PERSON_PHONE_SEQUENCE.NEXTVAL",
PERSON_ID "PERSON.CURRVAL"
)

INTO TABLE PERSON_ADDRESS
WHERE (1) = 'PERSON_ADDRESS'
FIELDS TERMINATED BY '|' TRAILING NULLCOLS
(
RECORD_TYPE FILLER,
FILE_ID,
STREET,
SUITE,
CITY,
STATE,
ZIP,
PERSON_ADDRESS_ID "PERSON_ADDRESS_SEQUENCE.NEXTVAL",
PERSON_ID "PERSON.CURRVAL"
)

This did not work for a number of reasons.  I'll list them off.

  1. Every time the SQL*Loader reads a record, it has some sort of positional index thing that dictates the position in the file to read.  When the first field of each record is read, the default behavior is to take a NEXT.  I don't claim to completely understand why it doesn't start at the beginning, but I suspect it's something to do with the WHEN clause eating position 1.
  2. The WHEN clause doesn't care about the terminator.  Notice how the "FIELDS TERMINATED BY" part comes after the match?  Yeah.  Because all the tables start with PERSON, all of them matched for the PERSON table.  Ooops.
  3. The SQL*Loader behavior does each "INTO TABLE" block through for the entire file before going to the next.  And it commits between.  This means that all the matches for PERSON are done first, then all the PERSON_PHONE, then PERSON_ADDRESS.  As a result, using a sequence CURRVAL is not going to work.  What happens is all the PERSON records get loaded first.  Then the sequence is sitting on the most recent insert record.  When the PERSON_PHONE records or PERSON_ADDRESS records come along, all of them get the PERSON_ID of the last inserted PERSON record.  Not cool.

Successful Attempt

To say that this was attempt 2 would be a blatant lie.  It would be more like attempt 12.  Anyway, this is where I landed to achieve level 9001.

Luckily, I had complete control over the data file spec (I put it together from an even uglier file) and over the data model (at least enough to add a column).  I could bake the hierarchy into the file and use some queries.  Since my process created the hierarchy, I could trust it.  I did try some nonsense using key values--for example, including the name in each phone and address record--but that would not guarantee a unique result, so I had to toss it out with the moldy peaches.

New Data File:

PERSON|123|1|Robert||Robertson|
PERSON_PHONE|123|1|1115557777|CELLULAR
PERSON_PHONE|123|1|1115557772|WORK
PERSON_ADDRESS|123|1|555 Robin Ln||St. Robert|CA|88888|HOME
PERSON_ADDRESS|123|1|222 Bobbypin Circle|Suite 211|St. Robert|CA|88888|WORK
PERSON|123|2|William|W|Williams|Jr.
PERSON_PHONE|123|2|1115556666|HOME
PERSON_PHONE|123|2|1115556622|WORK
PERSON_ADDRESS|123|2|212 Wiley St. E||St. Robert|CA|88888-8888|WORK
PERSON_ADDRESS|123|2|444 Willie St|Apt 222|St. Robert|CA|88888|HOME

Data Model:

CREATE TABLE PERSON (
PERSON_ID NUMBER NOT NULL PRIMARY KEY,
FILE_ID NUMBER,
SEQUENCE_ID NUMBER,
FIRST_NAME VARCHAR2,
MIDDLE_NAME VARCHAR2,
LAST_NAME VARCHAR2,
);

CREATE TABLE PERSON_PHONE (
PERSON_PHONE_ID NUMBER NOT NULL PRIMARY KEY,
PERSON_ID NUMBER FOREIGN KEY REFERENCES PERSON (PERSON_ID),
FILE_ID NUMBER,
PHONE_NUMBER VARCHAR2,
PHONE_TYPE VARCHAR2
);

CREATE TABLE PERSON_ADDRESS (
PERSON_ADDRESS_ID NUMBER NOT NULL PRIMARY KEY,
PERSON_ID NUMBER FOREIGN KEY REFERENCES PERSON (PERSON_ID),
FILE_ID NUMBER,
STREET VARCHAR2,
SUITE VARCHAR2,
CITY VARCHAR2,
STATE VARCHAR2,
ZIP VARCHAR2
);

New Control File:

LOAD DATA
CHARACTERSET WE8MSWIN1252

APPEND

INTO TABLE PERSON
WHERE (1) = 'PERSON' AND (1) != 'PERSON_'
FIELDS TERMINATED BY '|' TRAILING NULLCOLS
(
RECORD_TYPE FILLER POSITION(1),
FILE_ID,
SEQUENCE_ID,
FIRST_NAME,
MIDDLE_NAME,
LAST_NAME,
PERSON_ID "PERSON_SEQUENCE.NEXTVAL"
)

INTO TABLE PERSON_PHONE
WHERE (1) = 'PERSON_PHONE'
FIELDS TERMINATED BY '|' TRAILING NULLCOLS
(
RECORD_TYPE FILLER POSITION(1),
SEQUENCE_ID BOUNDFILLER,
FILE_ID,
PHONE_NUMBER,
PHONE_TYPE,
PERSON_PHONE_ID "PERSON_PHONE_SEQUENCE.NEXTVAL",
PERSON_ID "(SELECT PERSON_ID FROM PERSON WHERE FILE_ID = :FILE_ID AND SEQUENCE_ID = :SEQUENCE_ID)"
)

INTO TABLE PERSON_ADDRESS
WHERE (1) = 'PERSON_ADDRESS'
FIELDS TERMINATED BY '|' TRAILING NULLCOLS
(
RECORD_TYPE FILLER POSITION(1),
SEQUENCE_ID BOUNDFILLER,
FILE_ID,
STREET,
  SUITE,
  CITY,
  STATE,
  ZIP,
  PERSON_ADDRESS_ID "PERSON_ADDRESS_SEQUENCE.NEXTVAL",
  PERSON_ID "(SELECT PERSON_ID FROM PERSON WHERE FILE_ID = :FILE_ID AND SEQUENCE_ID = :SEQUENCE_ID)"
)

 Notice a few things.

  1. The first field in each table has a POSITION(1) flag on it to fix the problem of grabbing a NEXT instead of the first column.
  2. The PERSON table now has an AND in the WHEN clause for not matching to other tables.
  3. Now there is a SQL query using the sequence number for referencing the right row.
    1. The SEQUENCE_ID in the address and phone tables don't have a column, so we had to flag it is a FILLER, but not just any filler; a BOUNDFILLER.  This allows the column to be available in the query for PERSON_ID.  Otherwise no dice.
    2. The select query is in parenthesis.  If you don't do that, you can't use a select query.  This is because you are only supposed to access functions in the control file.  Apparently putting a select query in parenthesis makes it a function.

So there it is.  Peace at last.

 

New comment