Schema Analysis

As well as determining high-level information relating to the file as a whole, RPI is able to make an “educated guess” as to the fields the file contains.

Note that a file’s schema is only analyzed at initial analysis when the file is delimited.  If the file is fixed-width, its schema is analyzed at re-analysis (post-confirmation of the file’s field boundaries).

Note also that there exist a few subtle differences between the actions undertaken during delimited and fixed-width schema analysis – these are described at the end of this section.

For each field in the file, RPI can determine:

      Name: if the file has a header row, the field names it contains are used.  Any database-incompatible characters are replaced with underscores.  If any non-unique names are present, a space and integer (starting with 2, and incremented if necessary) are appended to ensure field name uniqueness.

If a field name in the header is blank, a “filler” name is used.  The filler name consists of a prefix (defined by system configuration setting DefaultFillerFieldNames) and an integer with leading zeroes (starting at 001 and incremented as required).

If no header row is present in the file, fields are named automatically.  Automatic field names consist of a prefix (defined by system configuration setting DefaultFieldNames) and an integer with leading zeroes (starting at 001 and incremented as required).

      Data type: RPI is able to determine a field’s data type using the following rules:

String: a field is determined to be a string if it contains alphanumeric data or a mixture of data.  A completely blank column is determined to be String(64).

Integer: an integer field contains only integer data.  In addition, the field name must end with one of the following:

§ CNT

§ COUNT

§ PRICE

§ AMOUNT

§ NUMBER

§ AMT

If field appears to be Integer but contains one or more leading zeroes, it is typed as String.

Decimal: all data within the fields must be either integer or decimal.  Integers with commas or currency symbols are parsed as decimal.  Leading zeroes are disregarded.  No field name checking is performed.

Date: all values must conform to one of the following date formats:

§ M/d/yyyy

§ M/d/yy

§ dddd, MMMM d, yyyy

§ d.M.yyyy

§ d-M-yyyy

§ d/M/yyyy

§ d.M.yy

§ d-M-yy

§ d/M/yy

§ d MMM yyyy

§ d MMMM yyyy

§ d MMMM yy

§ d MMM yy

§ d-MMM-yy

§ yyyy-M-d

If two competing formats are found, the field defaults to String.

Time: values must conform to one of the following time formats:

§ H:m:s.fff

§ H:m:s

§ h:m:s.ffftt

§ h:m:stt

§ h:m:s.fff tt

§ h:m:s tt

§ H:m

§ h:mtt

§ h:m tt

If two competing formats are found, the field defaults to String.

DateTime: all data conforms to a combination of one of the supported Date and one of the Time formats, separated by a space.  If two competing formats are found, the field defaults to String.

      Size: only relevant for data types String, Integer and Decimal.

String: size is set in accordance with the maximum observed string length within the file.  If the size exceeds 4000 characters, data is truncated.  If greater than 4, size is rounded up to the nearest power of 2 (i.e. 8, 16, 32, 64, 128, 256, 512, 1024, 2048 up to 4000).

Integer: size is set in accordance with maximum number of digits observed.

Decimal: size is set to the field’s precision – the maximum number of pre-decimal digits plus the maximum number of post-decimal digits.

      Scale: only relevant for Decimal data.  Scale is set to the maximum number of post-decimal digits.

Schema analysis performed in respect of fixed-width files differs from delimited files as follows:

      Fields cannot be analyzed as integer due to the absence of a header record.

      A numerical field is determined to be a decimal if it contains at least one decimal value (i.e. one value with at least one digit to the right of the mantissa).

      An empty string field’s size is determined by the actual field width, rather than being set to 64.

      A string field’s size is set to the actual field width, as opposed to using the “power of two” rule.