CIS4365: Database Applications
Fall, 2017

WHY IS FILE ORGANIZATION IMPORTANT?

 File Organization

File organization is a technique for physically arranging the records of a file on secondary storage. Given that a file consists of a collection of records, a key element in file management is the way in which the records themselves are organized inside the file, since this heavily affects system performances as far as record finding and access.

Choosing a file organization is a design decision, hence it must be done having in mind the achievement of good performance with respect to the most likely usage of the file. In choosing a file organization for a particular file in a database, you should consider seven important factors:

bulletFast data retrieval
bulletHigh throughput for processing data input and maintenance transactions
bulletEfficient use of storage space
bulletProtection from failures or data loss
bulletMinimizing need for reorganization
bulletAccommodating growth
bulletSecurity from unauthorized use

In file organization there are several ways of organizing records in files, which are as follows:

bulletsequential file organization
bulletindexed file organization
bullethashing file organization

Sequential File Organization

Sequential file organization is the storage of records in a file in sequence according to a primary key value. It is the most common structure for large files that are typically processed in their entirety, and it's at the heart of the more complex schemes. In this scheme, all the records have the same size and the same field format, with the fields having fixed size as well. The records are sorted in the file according to the content of a field of a scalar type, called ``key''. The key must identify uniquely a records, hence different record have different keys. This organization is well suited for batch processing of the entire file, without adding or deleting items: this kind of operation can take advantage of the fixed size of records and file; moreover, this organization is easily stored both on disk and tape. The key ordering, along with the fixed record size, makes this organization open to to dicotomic search. However, adding and deleting records to this kind of file is a tricky process: the logical sequence of records typically matches their physical layout on the media storage, so to ease file navigation, hence adding a record and maintaining the key order requires a reorganization of the whole file. The usual solution is to make use of a ``log file'' (also called ``transaction file''), structured as a pile, to perform this kind of modification, and periodically perform a batch update on the master file.

A sequential file is designed for efficient processing of records in sorted order on some search key.

bulletRecords are chained together by pointers to permit fast retrieval in search key order.
bulletPointer points to next record in order.
bulletRecords are stored physically in search key order (or as close to this as possible).
bulletThis minimizes number of block accesses.
bulletFigure 10.15 shows an example, with bname as the search key.

It is difficult to maintain physical sequential order as records are inserted and deleted.

bulletDeletion can be managed with the pointer chains.
bulletInsertion poses problems if no space where new record should go.
bulletIf space, use it, else put new record in an overflow block.
bulletAdjust pointers accordingly.
bulletFigure 10.16 shows the previous example after an insertion.
bulletProblem: we now have some records out of physical sequential order.
bulletIf very few records in overflow blocks, this will work well.
bulletIf order is lost, reorganize the file.
bulletReorganizations are expensive and done when system load is low.

If insertions rarely occur, we could keep the file in physically sorted order and reorganize when insertion occurs. In this case, the pointer fields are no longer required.

Aces

Boilermakers

Devils

Flyers

Hawkeyes

Hoosiers



Miners

Panthers



Seminoles

..

Indexed File Organization

Indexed file organization is the storage of records either sequentially or non-sequentially with an index that allows software to locate individual records. An index is a table or other data structure used to determine the location of rows in a file that satisfy some condition. Each index entry matches a key value with one or more records. An index can point to unique records (a primary key index) or potentially more than one record. A secondary key is one field or a combination of fields for which more than one record may have the same combination of values, which is also called a non-unique key.

When the terms primary and secondary index are used, there are four type of indexes:

bulletUnique primary index (UPI) is an index on a unique field, possibly the primary key of the table, which not only is used   to find table rows based on this field value but also is used by the DBMS to determine where to store a row based on the primary index field value.
bulletNon-unique primary index (NUPI) is an index on a non-unique field, which not only is used to find table rows based on this field value but also is used by the DBMS to determine where to store a row based on the primary index field value.
bulletUnique secondary index (USI) is an index on a unique field, which is used only to find table rows based on this field value.
bulletNon-unique secondary index (NUSI) is an index on a non-unique field, which is used only to find table rows based on this field value.

One of the most powerful capabilities of indexed file organizations is the ability to create multiple indexes.

Hashed File Organization

Hashed file organization is a storage system in which the address for each record is determined using a hashing algorithm. A hashing algorithm is a routine that converts a primary key value into a relative record number or relative file address. A typical hashing algorithm uses the technique of dividing each primary key value by a suitable prime number and then using the remainder of the division as the relative storage location. A hashing index table is a file organization that uses hashing to map a key into a location in an index, where there is a pointer to the actual data record matching the hash key. As with sequential or indexed files, a key field is required for this organization, as well as fixed record length. However, no explicit ordering in the keys is used for the hash search, other than the one implicitly determined by a hash function. A hash function is computed on some attribute of each record. The result of the function specifies in which block of the file the record should be placed.

Hashing may require overflow pages, sequential retrieval is impractical, random retrieval on primary keys is very fast since it does not need to access index, and deletion, addition, and modification of records are relatively easy. Hashing algorithm converts a primary key value into a record address.

The three families of file organizations cover most of the file organizations you will have at your disposal as you design physical files and databases.
 

References:

Hoffer, Jeffery A., Prescott, Mary B., McFadden, Fred R., Modern Database Management, Sixth Edition

http://www.cim.mcgill.ca/~franco/OpSys-304-427/lecture-notes/node53.html

http://www.cs.sfu.ca/CC/354/zaiane/material/notes/contents.html

http://www.cs.fiu.edu/~yuanj/cop4540_f00/lecture20.ppt#268,15,Hashed File Organization

 

Questions and Answers:

1. What is file organization?

        File organization is a technique for physically arranging the records of a file on secondary storage.

2. List three important factors of file organization.

        Fast data retrieval, Efficient use of storage space, and Minimizing need for reorganization.

3. What is indexed file organization?

Indexed file organization is the storage of records either sequentially or non-sequentially with an index that allows software to locate individual records.

4. List the four types of indexes used for primary and secondary index.

        Unique primary index, non-unique primary index, unique secondary index, and non-unique secondary index.

5. What is hashed file organization?

Hashed file organization is a storage system in which the address for each record is determined using a hashing algorithm.

6. What are the two hashing techniques mentioned in this tutorial?

        Hashing algorithm and hash index table.