wpe41.gif (23084 bytes)CIS5394: Decision Making and Expert Systems
(Current Issues in CIS)
Spring 2004

Sample Paper

 ******************

NOTE: This is a research paper I wrote some time ago, and which was eventually published in Management Science. It is included here not because it is a great example of what a research paper should be (although, I am personally very happy with it) but because:

  1. It follows all the rules I have laid down for you
  2. Getting copies of other published works requires copywrite permissions. It’s a big hassle

******************

Accessibility, Security and Accuracy in Statistical Databases:

The Case For the Multiplicative Fixed Data Perturbation Approach

ABSTRACT

Organizations store data regarding their operations, employees, consumers, and suppliers in their databases.  Some of the data contained in such databases are considered confidential, and by law, the organization is required to provide appropriate security measures in order to preserve the confidentiality.  Yet, a large number of companies have little or no security measures.  The reason for this lack of security can be attributed to the fact that very little is known about the relative effectiveness of security mechanisms.  This study investigates the effectiveness of different security mechanisms that can be employed for protecting numerical, confidential attributes in a database.  The trade-off between security, accessibility, and accuracy are examined.  A comparison of different security mechanisms reveals that data perturbation is the preferred security mechanism since it minimizes security lapses while maximizing accessibility.  An investigation of different approaches for data perturbation indicates that multiplicative fixed data perturbation provides higher level of accuracy (by maintaining the characteristics of the original dataset), without sacrificing security.  Based on its ability to provide high levels of security, accessibility, and accuracy, multiplicative fixed data perturbation is the approach recommended to protect confidential numerical attributes residing in organizational databases.

 INTRODUCTION

      With the proliferation and expanded utilization of corporate data­bases, the inci­dence of com­puter abuse has increased corres­pond­ingly (Parker, 1981; Wong, 1985; Straub, 1990).  In response, legis­lation pro­hi­bi­ting the unauthorized release of sensi­tive personal infor­ma­tion has been enacted (U.S. Depart­ment of HEW Advisory Committee on Automated Per­sonal Data Systems, 1973),  and organizations are being held account­able for breaches in data security.  There are over 15 federal legislative acts, and a greater number of state laws, pro­tect­ing individuals from unwarranted release of infor­mation (Laudon, 1986).  Such legis­la­tion, how­ever, is not intended to deny access to databases for report­ing, decision making, and research pur­­poses.  For example, the U.S. Freedom of Information Act requires some govern­ment agencies and pri­vate organ­izations to make certain information available on request, as do a num­ber of state sun­shine laws.

     Presently, an acute conflict exists between the individual's right to privacy and society's need to know and process information  (Palley, 1986; Palley and Simonoff, 1987).  The dilem­ma char­acter­izes three of the concerns which Mason (1986) has termed among the most impor­tant ethical issues facing today's Information Systems (IS) Manager: Privacy, Accuracy, and Accessibility. The organization is charged with the respon­sibility of en­suring that accurate data is securely maintained but accessible to authorized individuals within the organization and, in some cases, outside the organ­ization (Wysocki and Young, 1990; Laudon and Laudon, 1991).  Failure to protect sensitive data, aside from the legal issues invol­ved, can lower employee and customer confidence in the organization's practices, while pro­viding inadequate access to accurate infor­mation can hinder organizational functioning and growth.

     Surveys of IS managers validate the importance of secur­ity and control (Dickson, et al., 1984; Hartlog and Herbert, 1986; Brancheau and Wetherbe, 1987),  but also indicate a reluctance to institute binding security measures (Straub, 1990).  As of 1986, only 60% of all organizations had imple­mented IS security as a functional area (Hoffer and Straub, 1989).  A recent study (Datamation, 1993), reported that 33% of the companies surveyed had little or no security measures.  Even those that did, typi­cally devoted less than nine hours a week to the function (Straub and Hoffer, 1987) and gen­er­ally assigned respon­­sibility for such mea­sures to lower level managers (Straub, 1988).  Straub spec­­ulates that the most convincing rationale for this situ­ation is that IS managers have mis­informed opinions about the efficacy and net worth of investment in security mechanisms (Goodhue and Straub, 1988).  Contributing to these impressions is the lack of empir­ical findings on the relative effectiveness and the paucity of estab­lished guidelines for imple­men­ting security mechanisms.

     In implementing database security measures, the IS manager in charge of database security needs to consider a series of trade-offs involving security, acces­sibility, and accur­acy.  The security/accessibility trade-off is often add­res­­sed through a variety of security mechanisms.  In order to provide adequate security, some of these mechanisms require altering the database.  Such alterations change the characteristics of the dataset and hence reduce the accuracy.  However, as Denning et al. (1979) have noted, "the require­ment of complete secrecy of confidential information is not consis­tent with the requirement of producing exact statistical measures for arbitrary subsets of the population.  At least one of these requirements must be relaxed ..." (pg. 92 ).  It is the respon­sibility of the IS manager to select the appropriate security mechanism which will maximize security, minimize bias, and provide high accessibility. 

     The objective of this study is to enable the IS manager to select the appropriate procedure for database security.  In order to achieve this objective, we first compare the relative effectiveness of different security mechanisms and recommend the mechanism which best addresses security and accessibility issues.  Next, for the preferred security mechanism, we evaluate the relative effectiveness of different techniques available and recommend the technique that provides higher security and accuracy.       The paper is organized as follows:  We first consider some of the salient characteristics of organizational databases and the security measures available to protect them. Next, we focus our attention to one class of security measures, the fixed data perturbation approach, and the techniques applied to create the surrogate database. We then describe the methodology used to examine the efficacy of the techniques, and discuss the findings in detail. The final sec­tion presents the con­clusions of the study. 

DATABASE SECURITY MECHANISMS

     Organizational databases can be defined, in general terms, as collections of datasets about the organization, including transactions, customers, suppliers, employees, and other data in­tended to support operations, as well as the relationships between them. While most data­sets tend to be numerical,  some categorical datasets may also exist. This study examines only those datasets that contain numerical information. 

     Empirical evidence suggests that numerical datasets relating to organizations generally follow a Log-normal distribution.  The most convincing finding was provided by Neter and Loebbecke (1975), who, based on an extensive review of accounting data, concluded that the Log-normal distribution is best suited to describe accounting data.  Similar assertions have been made for quite some time (Simon and Bonini, 1958;  Steindl, 1965;  Quandt, 1966;  Brown, 1967;  Charnes, et al., 1968;  Thatcher, 1968; O'Neill and Wells, 1972; Herron, 1974) and continue to find favor in more recent studies (Aitchison and Brown, 1976;  Jain, 1977;  Easton, 1980;  Lawrence, 1980; Crow and Shimuzu, 1988). 

     The Log-normal distribution offers a rich variety of forms, from approximately normal to heavily skewed[1] and is usually characterized using three parameters (shape, scale, and shift).  The Log-normal distribution allows for the representation of the two types of numerical data encountered in business databases; datasets where both positive and negative values may occur (such as account balances, changes in reve­nue, and inventory levels with back-orders), and datasets where only positive values are allowed (such as salary, employee ages, number of em­ployees, units produced, and sales).  Based on its ability to describe business data as reported in the literature, in this study the Log-normal distribution is used to describe organizational datasets.

     Any mechanism selected by an IS manager to provide database security must not alter the basic char­acter­istics of the database and the values of individual attributes.  If an attribute's values have a Log-normal distribution with specified characteristics, the security mech­anism used to guard against disclosures should yield outcomes that represent the attribute's characteristics.

     IS managers can use Statistical Database Systems (SDBS) in order to secure databases.  SDBS is a database system that provides users only with aggregate statistics of confidential attributes and prevent users from gathering information on specific individuals (Adam and Jones, 1989).  Nonetheless, SDBS, especially those which rely on the actual datasets for analysis, are not necessarily inviolate. They may be com­pro­­mised by partial or exact disclosure (Beck, 1980;  Adam and Wortman, 1989) either unin­tentionally or as a result of the skills of a determined 'snooper'.  Consider the following examples:

1.   Assume that two datasets containing age and salary information are part of a SDBS.  Exact disclosure may occur if the scope of a search is narrowed to the point where there is only one individual satisfying a set of select conditions. 

2.   Assume the date an employee joins a particular department within an organization is known, and that employee is the only individual hired on that day.  If the average salary of the department immediately prior to and after the hiring is provided by the SDBS, the exact salary of the new employee can be easily determined.

            While such examples may seem contrived, they are in fact quite representative of the dif­ficulties of maintaining confidentiality in SDBS.  Miller (1971) describes an event where a data­base of 188 physicians in Illinois was created and an SDBS approach was employed so that an indiv­idual physician could not be identified.  Yet, exact disclosure was achieved by sub-dividing the database into special cate­gories.  Adam and Jones (1989) also provide examples of queries which allow for exact disclosure.  Hence, even if the IS manager decides to use SDBS for se­cur­ity purposes, it is necessary to implement additional protection mech­anisms to reduce the possibility of partial or exact dis­clo­sure. 

     Four general classes of security control mechanisms are often used for this providing added protection, namely, Conceptual Models, Query Restriction, Output Perturbation, and Data Perturbation.  Adam and Wortman (1989) provide a comprehensive survey of these methods.  Based on their survey, they conclude that no single security mechanism completely satisfies all the requirements of database security and suggest that more research is required in this area.

Conceptual Model

     In the conceptual model (Chin and Ozsoyoglu, 1981;  Ozsoyoglu and Chin, 1982), only the collection of datasets with common attributes and their statistics are made avail­able to the user.  No data manipulation language, such as one based on relational algebra, is allowed to merge and inter­sect populations.  While this method is attractive in terms of security, it severely limits access and does not allow retrieval of related information.  Further, the conceptual model has proved extremely difficult to implement and update (Ozsoyoglu and Ozsoyoglu, 1981;  Ozsoyoglu and Su, 1985) and is very expensive to develop and maintain (Adam and Wortman, 1989).          

Query Restriction

     This approach involves a variety of programming restrictions placed upon users including query-set-size controls (Fellegi, 1972), restriction on the number of over-lapping entities among succes­sive queries (Dobkin, et al., 1979), auditing of queries made by each user (Hoffman, 1977), and clustering of individual entities of the population in a number of mutually exclusive subsets (Chin and Ozsoyoglu, 1981).  Denning et al. (1979), however, have shown that sensitive information may be easily compro­mised even if query restriction is employed, and the approach has also been deemed combina­torially unwieldy, program­matically complicated, and expensive (Denning, 1982;  Denning and Schlorer, 1983). 

Output Perturbation

     Output perturbation allows queries to be made on the actual datasets, but program­mati­c­ally perturbs all results prior to disclosure to the user.  A variety of techniques can be used to implement output perturbation inclu­d­ing analysis of a random sample of the population (Denning, 1980), the varying of values at vary­ing rates (Beck, 1980), and rounding off results (Achugbue and Chin, 1979;  Fellegi and Phillips, 1979;  Haq, 1975).  The advantage of the approach is that actual values are not dis­closed. Nonetheless, a single query may yield a spurious outcome while repeated queries may allow the 'true' value to be deduced (Liew et al., 1985). In such cases, it is also necessary to specify minimum query-set-size restriction, which can further limit accessibility.  Addition­ally, Adam and Jones (1989) have pointed out that the instructions necessary to perturb the output would require complex programming and must remain on-line, placing additional demands on CPU and memory.

Data Perturbation

     In the data perturbation approach, a secondary database, hereafter referred to as the perturbed database (PDB), is constructed by altering the original values by a random factor generated from a pre­determined distribution.  Considering the trade-off between security and accessibility, data perturbation provides the following advantages compared to the other security mechanisms:

(1)   It provides maximum security against exact disclosure since it guarantees that exact disclosure will never occur,

(2)   It provides maximum accessibility since users may be permitted complete access to the perturbed database,

(3)   Unlike other security mechanisms, algorithms required to perturb the data are relatively simple and are readily available in most statistical and simulation software packages, and

(4)   Unlike other security mechanisms, data perturbation software need not be kept on-line.

     A number of different data perturbation techniques have been proposed including fixed data perturbation (Traub, 1984), data swapping (Reiss, 1984), multi-dimen­sional trans­formation of attributes (Schlorer, 1981), data distortion by probability distri­bution (Liew, et. al., 1985), and the substitution of data from the estimated density function (Lefons, et. al., 1983).  Fixed data perturbation (FDP) provides the following advantages over the other methods of data perturbation:

(1)   Unlike data swapping and multi-dimensional transformation of attributes which can be applied only to multi-categorical datasets, FDP can be applied to any numerical dataset.

(2)   Unlike data distortion by probability distribution and substitution of data from the estimated density function which do not control the level of security resulting in partial disclosure (Adam and Wortman, 1989;  Muralidhar and Batra, 1990), in the FDP method, partial disclosure can be eliminated by selecting the appropriate level of security.

The disadvantages of FDP are:

(1)   Additional disk storage is required to store the perturbed database,

(2)   Concurrency with the original dataset must be monitored (i.e. when the original value changes, the PDB must be changed accordingly), and

(3)   Data accuracy is reduced because perturbing the dataset introduces bias in statistical estimation. 

    Thus, while the FDP is not completely without disadvantages, considering the high level of security and accessibility it provides, it represents the preferred security mechanism in protecting confidential data existing in databases. The following section describes two fixed data perturbation techniques that are commonly used.         

FIXED DATA PERTURBATION TECHNIQUES

     To create a perturbed dataset using FDP, the numeric attribute values in the original dataset are altered by a random factor generated from a pre-determined distri­bution.  Perturba­tion is achieved by one of two techniques:  additive or multiplicative. 

     In the additive approach, a random error term is generated from a pre-specified, indepen­dent distribution and the perturbed value takes the form:

                         yi = xi + ei

where

            yi is the ith observation of the perturbed series,

            xi is the ith observation of the original series, and

            ei is a random variable with zero mean and a pre-specified variance.

 

Since E(e) = 0, the expected value of the mean of the original and perturbed series are iden­ti­cal. 

     In the multiplicative approach, perturbation takes the form

                        yi = xi * ei.

In this case, e is a random variable with a mean of 1 and a pre-specified variance.  The expected value of the mean of the PDB is again identical to that of the original data­set.

               There are some inherent differences between the AP and MP methods of fixed data per­tur­bation.  The primary distinction is that the perturbed values from the AP method are indepen­dent of the original dataset values while MP values are a function of the original dataset.  In other words, the expected level of perturbation resulting from the AP method would be the same for a value of $10,000 or $100,000, while MP results in values in proportion to the original series;  a perturbation based on an original dataset value of $10,000 will yield a smaller change than one based on a value of $100,000.   

     Both methods of FDP use a random variable e to mask the original values so that they will not be disclosed.  However, if the values of e are not carefully selected, the perturbed and original values may be so close that partial disclosure may still occur.  Hence, it is necessary to select the distribution (and parameters associated with that distribution) of e so that the perturbed dataset provides an adequate level of security.  In general, for a given distribution and given method of perturbation, the magnitude of e will be higher if the variance of e is greater.

     The selection of the distribution of e will also affect the level of security provided.  Muralidhar and Batra (1990) found that better security and accuracy for a Log-normal population was provided by a Log-normal perturbation.  This is consistent with the suggestion by Liew, et. al. (1985) that skewed populations should be perturbed by skewed distributions.  Hence, in this study, the distribution of e is assumed to be Log-normal.

     Since both methods of FDP rely on a random variable to mask the original values, the perturbed dataset will result in statistical measures that are different from the original dataset.  This difference represents the bias due to perturbation.   In general, the higher the variance of e, the higher the perturbation bias.  Thus, increasing the variance of e leads to increased security at the expense of accuracy (due to increased bias).  The security and bias resulting from perturbation can be illustrated by using an example dataset.

     Table 1[2] offers an example of how the two perturbation techniques can be applied. The table includes a 'true' series of values drawn from a Log-normal population, sorted in ascending order, the dataset perturbed using the multiplicative (MP) and the additive (AP) methods, and the relative position of a given value.  The mean, standard devi­ation, and selected percentiles are pro­vided for each series, as are the product moment and rank order correlations between the origi­nal and perturbed datasets. Note that while the theoretical mean for all three series are equal, in practice small differences can be anticipated.  

--------------------------

Insert Table 1. about here

--------------------------

      The most attractive aspect of fixed data perturbation is that, while the access to the original dataset is restricted, users may be allowed unre­stricted access to the PDB without fear of exact disclosure. As Table 1 shows, without the original values, it is impossible to exactly determine any of the original values based only on the perturbed values.  As mentioned earlier, care must be taken to select e so that the values are appropriately perturbed.  Also note that the resulting PDB from the additive and multiplicative methods are different.

     The degree to which the perturbed dataset resembles and maintains the characteristics of the original dataset is a primary concern.  Since the main purpose of a SDBS is to gather statistical information, a comparison of the most fre­quently used statistical measures is warranted. A hypothesis test of equality of means indicates that the null hypothesis that the means were equal could not be rejected (p-value > 0.05). 

     Table 1 also provides the different percentiles computed for the three datasets.  When con­trasted with the original dataset, the MP dataset yielded values closer to the original series at three levels and the AP case yielded two outcomes which were closer to the original dataset.  Table 1 also provides correlations between the original and perturbed datasets.  The correlations between the orig­inal series and the MP dataset is higher than the correlation between the original series and the AP dataset. Figure 1 provides the frequency distribution of the datasets to illustrate changes in form.  The MP dataset maps better with the original series than with the AP dataset.

--------------------------

Insert Figure 1. about here

--------------------------

     A further analysis of Table 1 also indicates that the MP dataset preserves the sign of the values in the original dataset.  In other words, if all the values in the original dataset were positive then the corresponding values in the MP dataset will also be positive.  For a PDB resulting from the AP method, it is possible that even if all the values in the original dataset were positive, the dataset resulting from the AP method could have negative values.  For example, the 10th observation in the original series which has a positive value (6,650.19), when perturbed using AP, results in a negative value (-14,814.57).  In the example dataset, this did not present a problem since the dataset contained both negative and positive values.  However, if the original series was positive (a series such as salary or age), then negative perturbed values would not be acceptable.  In such cases, it is necessary to modify the AP method so that the resulting perturbed value is also positive by re-selecting the perturbation value. 

     While implementing a procedure to ensure a positive PDB is not a major problem, the effect of such selective replacement of the perturbation values on the PDB is unknown.  Hence, it is necessary to analyze the effect of the selective replacement on the statistical characteristics of the resulting PDB.  For further analysis presented in this study, both cases will be considered, namely, datasets where both positive and negative values are allowed (general case) and datasets where only positive values are allowed (truncated case).  Note that no modifications are necessary to implement the multiplicative method in either case. 

       The results of the analysis based on one single population of a specified size (presented in Table 1) indicates no over-whelming support for either method, but seems to favor the MP method.  In addition, a single dataset is but one realization from an infinite set of possible datasets from that population.  The results derived from one dataset cannot be generalized to all datasets or to different populations.   Further, the example dataset also does not provide a clear pattern on the behavior of the bias resulting from perturbation.

     Thus, an investigation of the multiplicative and additive FDP techniques, comparing the security provided and the resulting bias in statistical estimation, is required.  Such investigation must also address a broad spectrum of possible datasets that occur in organizational databases so that the results can be generalized to all database populations.  The following section presents an analysis of additive and multiplicative FDP considering several statistical measures and a wide range of possible datasets. 

THE SECURITY VERSUS ACCURACY TRADE-OFF FOR FDP TECHNIQUES

     Security provided by the FDP techniques will be assessed by using a direct measure of perturbation, the Mean Absolute Perturbation.  In order to assess accuracy, or the level of bias, several statistical measures that are commonly used in managerial decision making were used. These include: (a)  The form of the distribution, (b) the mean,  (c) the variance, (d) different percentiles (5th, 25th, 50th, 75th, and 95th), and (e) product moment and rank order correlations. While prior studies have selectively used some of these measures (Liew, et. al., 1985;  Traub, et. al., 1984), none have used all these measures in the same analysis (Adam and Wortman, 1989). 

Measure of Security

     The level of security offered by a specific perturbation technique refers to the degree to which disclosure of sensitive information is prevented.  While the FDP does not allow for exact disclosure since each value in the dataset is perturbed, partial disclosure may still occur if the perturbed value is close to the original value.  The greater the difference between the original and perturbed values, the greater the protection provided.  Traditionally, the stan­dard deviation of the perturbation distribution has been used as a measure of security.  However, Muralidhar and Batra (1990) found that for a given standard deviation, different levels of security may be provided, and suggest that the level of security be directly derived.

            Mathematically, the absolute difference between the original and perturbed values can be expressed as

                        |yi - xi| = |ei|. 

The E(|e|) represents the true measure of security.  The value E(|e|) is defined as Mean Absolute Perturbation (MAP).

     In comparing the MAP of the multiplicative (MP) and the general case of the additive method (AP/G), it is possible to analytically derive the following relationship:

 If the parameters and distribution of e is the same for both the MP and AP/G, then the MAP of each will be equal if e in the additive case is scaled by the mean of the original database[3].

 Consider a Log-normal perturbation distribution e with shape and scale parameters selected to yield a mean of 1 and a standard deviation of σe, and shift parameter = 0.0. For MP, MAP can be analytically derived as

                        MAP = E(|Y ‑ X|) = E(|(X * e) ‑ X|) = E(|X * (e ‑ 1)|)

Since the perturbation distribution e is independent of X, the MAP can be rewritten as

                        MAP = E(|X|) * E(|e ‑ 1|) = μX * E(|e ‑ 1|) 

            where μX is the mean of the original database.

    For the  AP/G, the perturbation distribution, e, must be shifted (shift parameter = -1) such that it will have a mean of 0.0.  The perturbed distribution thus becomes

                        Y = X + (K * (e ‑ 1))

where K is  the scaling factor for the perturbed values so that they are in the same scale as the original values. The MAP in this case can be defined as

                         MAP  = E(|Y ‑ X|) = E(X + (K * (e ‑ 1)) ‑ X)

                                 = E(K * (e ‑ 1)|) = K * E(|e ‑ 1|)

 If K is selected as μX, then it can be seen that the MAP of the AP and MP approaches are equal. [MAP = μX  * E(|e ‑ 1|) ]

     The truncated case of the additive method (AP/T), where perturbed values falling below some lower limit are dis­allowed, relies on the truncated three parameter Log-normal distribution. It has previously been shown that the estimation of properties of such a distribution is not analytically possible (Johnson and Kotz, 1970). Hence, MAP and the statistical properties must be examined empirically in this case.

Statistical Measures

A.     Distribution of the Perturbed Dataset

     One key indicator of how well a PDB represents the original dataset is the degree to which the two distributions correspond.  Such a property is desirable since some statistical analysis (such as analysis of variance) require that the form of the distribution be transformed (by taking the log of the values) so that parametric assumptions will be satisfied.  Such transformations become easier if the exact form of the distribution is known.  The distribution of the MP dataset is Log-normal since the product of any two Log-normal distributions is also Log-normal (Johnson and Kotz, 1970).  Thus, the MP dataset retains the same distribution as the original dataset, and the logarithmic transformation applied to the original dataset will also hold for the MP dataset.  

     It is not possible to derive such a result for the AP method since the form of the distribution resulting from the addition of a Log-normal dataset and a Log-normal perturbation distribution is unknown. The distribution of the sum of two random variables is known only when the random variables are described by a few select statistical distributions such as the normal and special cases of the Gamma distribution (Johnson and Kotz, 1970).

B.     Bias in Estimating the Mean

     Neither the AP/G nor MP approaches lead to bias in estimating the mean since the expected value of the perturbation variable are 0 and 1 for AP and MP methods, respectively.  By contrast, the AP/T will have a positive bias in estimating the mean (i.e. the mean of the perturbed distribution will be higher than the original mean) since negative values will be replaced by positive values.  The exact bias cannot be analytically derived since the moments of AP/T distribution and the resulting perturbed distribution are unknown.

C.     Bias in Estimating the Variance

     The perturbed datasets resulting from both the additive and multiplicative method of data perturbation have variances which are higher than that of the original dataset.  It can be analytically shown that

 If the MAP of a dataset perturbed by the MP and AP/G is the same, then the variance of the perturbed dataset created using the MP method will be higher than the variance of the perturbed dataset created using the AP/G method.

     Consider a dataset with mean and variance of  μx and σx2, respectively.  Assume the dataset is perturbed using a Log-normal random variable e with μe = 1 and variance = σe2.  The AP dataset can be defined as

                        Y = X + μx(e - 1).

The variance of the resulting PDB (σy2) is equal to

                        σx2 + (μx2 σe2). 

If the same dataset is now perturbed by MP using the same perturbation variable e, then variance of the resulting perturbed distribution can be shown to be

                        σx2 + (μx2 σe2) + (σx2σe2). 

Since neither σx2 nor σe2 can be zero, the variance of the perturbed dataset using the MP method will be higher than that of the AP/G method.

     The analysis also indicates that the variance of the datasets resulting from MP and the AP/G case of the AP method can be analytically derived.  For the AP/T case,  the exclusion of some perturbation values results in the variance of the perturbed dataset being lower than even that of AP/G method, but its exact value has to be determined through empirical investigation.

D.     Bias in Estimating Percentiles

     Percentiles are often used by decision makers to summarize or describe the properties of large batches of quantitative data.  The 5th,  25th,  50th, 75th, and the 95th percentiles were chosen because they repre­sen­t the entire range of the distribution. Bias in the percentiles occurs as a result of changes in the form and variance of the perturbed distributions.  For the MP method, since the distribution of the PDB is known, each of the percentiles can be determined analytically.  For both the AP/G and AP/T cases of the AP method, the form of the resulting PDB is unknown and it is necessary to estimate the percentiles of the perturbed dataset using simulation.

E.     Product Moment Correlation

     The relationship between different attributes residing in a database (such as advertising and sales) are often analyzed using correlation.  Perturbing the attributes will reduce the strength of such relationships.  Since it is difficult to study the relationship between the dataset being perturbed and every other attribute, Pearson product moment correlation between the original dataset and the perturbed dataset is used as a surrogate measure of the degree to which such relationships will be affected.  A perturbed dataset that has a higher correlation with the original dataset will better preserve the relationship with other attributes.

     It can be shown that for any dataset,

 If the MAP of a dataset perturbed by the MP and AP/G methods are the same, then the product moment correlation between the original and perturbed datasets is higher for the AP/G method than for the MP method.

The correlation between the original and perturbed datasets can be defined as

                        r = Cov(X,Y)/(σx σy).

In the AP/G case,

                        Cov(X,Y) = E((X - μx)(Y - μy),

Since X and e are independent, μx is a constant, μx = μy, and the E(e) = 1, the expression can be simplified as

                         Cov(X,Y) = E(X2) - 2μxE(X) + μx2

                                         = E(X2) - 2μx2 + μx2

                                         = E(X2) - μx2 = σx2.

     For the MP case,

                         Cov(X,Y) = E((X - μx)(Xe - μx))

                                        = E(X2e - 2 Xμx + μx2)

                                        = E(X2)E(e) - 2μxE(x) + μx2

                                        = E(X2) -  μx2 = σx2.

Since Cov(X,Y) is the same for both perturbation methods and the variance of the MP method is higher than that of the AP/G method, the correlation between the AP/G dataset and the original dataset will be higher than that between the MP dataset and the original dataset.  Such analytic derivations are not possible for the AP/T case, and must therefore be examined empirically.

F.     Rank Order Correlation

     The final statistical measure considered in this study is rank order correlation.  This measure reflects the change in ordinality of the original dataset and also provides a measure of accuracy for such queries as counts within specific levels. As a holistic measure of distortion in ordinality, Spearman's rank-order correlation can be applied, but cannot be analytically derived (Daniel, 1978).

Experimental Analysis

     There are some aspects of both security and bias which cannot be derived through analytical means. Consequently, exten­sive Monte-Carlo simulations were conducted in order to compare the perturbation approaches.  As a basis of comparison, four population distributions, represented by four different levels of variance (σX) considered.  The form of these populations ranged from approximately normal to heavily skewed.  Each original dataset was created with 1,000 observations using IMSL (1989).  The mean of each population was specified as 1.0, and the location parameter was specified as 0.0.  Note that the use of a specified mean and location does not result in any loss of generality since a change in scale (e.g., changing the mean from 1.0 to 1,000.0) or location (e.g., changing the lower limit from 0.0 to -1.0) would also result in corresponding changes in scale and location of the PDB's as well. 

     For each original dataset, four different Log-normal perturbation distributions (represented by σe) were created for each of the three perturbation approaches (AP/G, AP/T, and MP).  Each perturbation distribution provides a specified level of security measured by MAP.  Thus, a total of 16 different population/perturbation combinations were simulated.  Each population/perturbation combination was replicated 1000 times, and the statistical measures which could not be analytically derived were determined as the average of the 1000 replications[4].

      The following section discusses the results derived using both the analytical and simulation procedures for each of the security and bias measures.

RESULTS

     Table 2 provides the analytical and simulation results of the study for the MP and AP/G methods. Note that for any distribution/ perturbation combination the resultant MAPs are equal.  This allows for comparison of bias resulting from the two approaches for the same level of security.  As proved earlier, the variance of the AP/G method, while higher than the original series, is closer to the original variance than the MP method.  The same holds true for product moment correlation with AP/G providing higher correlation with the original series than the MP method.  The differences between the AP/G and MP method for both variance and product moment correlation are more pronounced for populations with higher levels of skewness.  In general, the differences between the two methods are more pronounced (for all statistical measures) for highly skewed populations and/or high levels of perturbations.   

--------------------------

Insert Table 2. about here

--------------------------

     The comparison of the percentiles derived from the MP and AP/G method indicate that neither method dominates.  The MP method consistently outperforms the AP/G method at the 5th and 75th percentiles.  The AP/G method, in most cases, provides estimates closer to the original series at the 50th and 95th percentiles.  At the 25th percentile, the MP method provides closer estimates for 9 of the 16 combinations, the AP/G method for 5, and 2 resulted in ties.  Of 80 different cases involving percentiles, the MP method provides better estimates for 44 cases, the AP/G method in 32 cases, and 4 cases were tied.  The MP method performs worst at the 95th percentile where in 14 of the 16 cases, the AP/G method provides better estimates than the MP method.  If the 95th percentile is not considered, then the MP method provides better or equivalent estimates, compared to the AP/G method, in 46 out of 64 cases (72%).

     The rank order correlation results were unexpected.  Since the AP/G method provides better product moment correlation than the MP method, the same was also expected for rank order correlation.  The results show the reverse.  In 15 of the 16 combinations, the MP method preserves the ordinality of the data better than the AP/G method.  While this behavior can be attributed to the fact that the pertur­bation resulting from the MP method is a function of the original value, the extent to which the MP method dominates the AP/G method for this measure was surprising given that the AP/G method provides better product moment correlation. 

     The results of the comparison between the MP and AP/T method are provided in Table 3.  All the AP/T statistical measures were determined through simulation.  One of the major problems with the AP/T method is that, unlike the MP and AP/G methods, the AP/T method results in a bias in estimating the mean.  In the worst case, the mean provided of the PDB created using AP/T is 24% higher than the original series.  The finding that the AP/T provides such a high bias in estimating the most common statistical measure is a cause for concern.  The bias in estimating the mean of the PDB using the AP/T method is very small for populations with low levels of skewness.

--------------------------

Insert Table 3. about here

--------------------------

     The MAP provided by the AP/T method is also lower than that of the MP method, in most cases.  This implies that the security provided by the AP/T method is lower.  This aspect also creates a problem in comparing the statistical measures derived from the two methods.  A direct comparison was possible between MP and AP/G methods since both provide the same level of MAP, and bias comparisons could be made for a given level of security.  This is not the case in comparing the MP and AP/T methods.  As such, the discussion on comparing the bias resulting from the two methods must be tempered by the fact that any increased effectiveness provided by the AP/T method comes at the cost of lowered security.

     The variance of the AP/T method is always lower than that of the MP and AP/G methods.  This is to be expected since negative mean.  Due to the lower variance, the product moment correlation between the AP/T dataset and the original dataset is higher than that between the MP and original datasets.  The MP method, however, provides better rank order correlation that the AP/T method.      

     The results of the comparison between MP and AP/T methods for percentile estimates follow the same pattern as the MP and AP/G comparison.  The MP provides better estimates at the 5th and 75th percentiles, while the AP/T method provides less biased estimates at the 25th and 95th percentiles.  The performance of the AP/T method is better at the 50th percentile as well.  Of the 80 percentile estimates considered, the MP method provides less biased estimates in 40 cases, the AP/T method provides less biased estimates in 38 cases, and 2 cases resulted in ties.

     Thus, if the decision to select the appropriate method of fixed data perturbation is based on the number of times one method is better than another, the results favor the MP method.  In addition, the MP method provides several advantages:

(1)   It retains the same distribution (Log-normal) as the original series.

 (2)   The perturbation is relative, i.e., large values in the original dataset are perturbed more than smaller values.

 (3)  The MP method retains ordinality better than the AP method.  Hence, it is less likely (in a dataset like salary), that a superior/sub-ordinate salary rank is reversed using the MP method. 

 (4)  All the statistical measures of the PDB created using the MP method can be analytically derived;  this is not the case in the AP method.

 (5)  The MP method guarantees that the values in the PDB created have the same lower limit as the original dataset, thereby eliminating the need for any "truncation".

 (6)  In cases of truncation, the mean of the PDB from the AP method is biased, whereas the MP method provides an unbiased estimate.

The advantage of the AP method is that it provides variance closer to the true variance than the MP method.  Consequently, the correlation between the original and AP dataset is higher than that between original and MP dataset.  Overall,  considering all aspects of the trade-off between security and accuracy, the results of this study provide strong evidence to suggest that the multiplicative method of fixed data perturbation is better than the additive method.

     The results provided in Tables 2 and 3 can also be used to select the level of perturbation for a specific application. Consider, for example, the selection of a distribution best suited to perturb a Log-normal dataset with mean = 1000 and standard deviation = 500.  This dataset is identical to the dataset described in Table 2 under σx = 0.5, scaled by a factor of 1000. Based on the confidentiality of the data, the level of security (MAP) could be selected.  The resulting loss in accuracy can also be estimated.

     From a practical perspective, it should noted that the results presented in Tables 2 and 3 were either analytically derived or based on large number of simulations and represent the expected values of the different parameters.  For a specific case, as illustrated in Table 1, the actual values will differ slightly from the expected values.  In general, the actual values should approach the expected values as the size of the original dataset increases.

CONCLUSIONS

     Organizational databases contain extensive information accessible to a large number of users.  The incidence of computer abuse where confidential information has been divulged has also increased, requiring much more managerial control for protecting databases.  A recent study indicates that 33% of the companies surveyed had little or no security measures (Datamation, 1993).  The main reason for the lack of security is the paucity of information regarding security mechanisms.  The objective of this paper was to investigate the effectiveness of the available techniques and provide guidelines for implementation to protect confidential, numerical attributes in organizational databases. 

     An IS manager who is responsible for providing appropriate security measures must also allow maximum accessibility and accuracy.  These objectives, however, may conflict: Increasing accessibility would decrease security, and Increasing security would decrease accuracy.  Hence, it is necessary to resolve the conflicting multiple objectives in order to select the appropriate security mechanism.  The majority of prior research conducted on this topic is devoted to general application issues and to algorithmic refinement.  This paper attempts to provide new insights into this issue from a decision making and practical perspective by investigating the tradeoff between accessibility, security, and accuracy for the type of data most often contained in corporate databases.

     Based on an analysis of different security mechanisms, this study recommends the use of the fixed data perturbation approach.  This approach maximizes accessibility by providing complete access to the perturbed database and maximizes security by never revealing the true value.  The fixed data perturbation approach is also easy to implement.   An analysis comparing the two fixed data perturbation techniques indicates that multiplicative perturbation presents the better choice.  The perturbed dataset resulting from the multiplicative method retains the characteristics of the original dataset better than the additive method and hence provides a higher level of accuracy.   Thus, considering all aspects of organizational database management, multiplicative fixed data perturbation offers the most promise for success[5].

 

REFERENCES

Achugbue, J.O., and Chin, F.Y. "The effectiveness of output modification by rounding for protection of statistical databases", INFOR (17:3), August 1979, pp. 209‑218.

Adam, N.R., and Jones, D.H. "Security of Statistical Databases with an Output Perturbation Technique", Journal of Management Information Systems (6:1), Summer 1989, pp. 101‑110.

Adam, N.R., and Wortmann, J.C. "Security‑Control Methods for Statistical Databases: A Comparative Study", ACM Computing Surveys (21:4), December 1989, pp. 515‑556.

Aithchson, J., and Brown, J.A.C. The Lognormal Distribution, Cambridge, MA: University Press, 1957.

Beck, L.L. "A security mechanism for statistical databases", ACM Transactions on Database Systems (5:3), September 1980, pp. 316‑338.

Brancheau, J., and Wetherbe, J.C. "Key Issues in Information Systems ‑ 1986", MIS Quarterly (11:1), 1987, pp. 23‑45.

Brown, R.G. Decision Rules for Inventory Management, New York: Holt, Rinehardt and Winston, 1967.

Charnes, A., Cooper, W.W., Devoe, J.K., Learner, D.B., and Reinecke, W. "A Goal‑programming model for Media Planning", Management Science (14:8), 1968, pp. B423‑B430.

Chin, F.Y., and Ozsoyoglu, G. "Statistical database design", ACM Transactions on Database Systems (6:1), March 1981, pp. 113‑139.

Crow, E.L., and Shimuzi, K. Lognormal Distributions: Theory and Practice, New York: Marcel Dekker, Inc., 1988.

Daniel, W.W, Applied Non Parametric Statistics, Boston, MA: Houghton Mifflin Company, 1978.

Datamation, "Computer Security Problems?  You're not alone", May 15, 1993, p. 24.

Denning, D.E., Denning, P.J., and Schwartz, M.D. "The tracker: A threat to statistical database security", ACM Transactions on Database Systems (4:1), 1979, pp. 76‑96.

Denning, D.E. "Secure statistical databases with random sample queries", ACM Transactions on Database Systems (5:3), September 1980, pp. 291‑315.

Denning, D.E. Cryptography and Data Security, Reading, MA: Addison‑Wesley, 1982.

Denning, D.E., and Schlorer, J. "Inference Control for Statistical Databases", Computer (16:7), July 1983, pp. 69‑82.

Dickson, G. W., Leitheiser, R.L., Wetherbe, J.C., and Nechis, M. "Key information systems issues for the 80's", MIS Quarterly (8:3), 1984, pp. 135‑159.

Dobkin, D., Jones, A.K., and Lipton, R.J. "Secure databases: Protection against user influence" ACM Transactions on Database Systems (4:1), March 1979, pp. 97‑106.

Easton, G. "Stochastic Models of Industrial Buying Behavior", Omega (8:1), 1980, pp. 63‑69.

Fellegi, I.P. "On the question of statistical confidentiality", Journal of the American Statistical Society (67:3), March 1972, pp. 7‑18.

Fellegi, I.P., and Phillips, J.L. "Statistical confidentiality: Some theory and applications to data dissemination", American Economic Society Measures (3:2), April 1979, pp. 399‑409.

Goodhue, D.L., and Straub, D.W. "Security Concerns for Systems Users: A proposed study of user perceptions by the adequacy of security measures", Proceedings of the 22nd Annual Hawaii International Conference on Systems Science, Kona, HA, January 1988.

Hartlog, C., and Herbert, M. "1985 Opinion Survey of MIS Managers: Key issues", MIS Quarterly (10:4), 1986, pp. 351‑361.

Haq, M.L. "Insuring individual's privacy from statistical database users", Proceedings of the National Computer Conference, 1975, Montvale, NJ. AFIPS Press, Arlington, VA, pp. 941‑946.

Herron, D.P. "Profit‑Oriented Techniques for Managing Independent Demand Inventories", Production and Inventory Management (15:1), 1974, pp. 57‑74.

Hoffer, J.A., and Straub, D.W. "The 9 to 5 Underground: Are you policing Computer Crimes?", Sloan Management Review (30:4), 1989, pp. 35‑44.

Hoffman, L.J. Modern Methods for Computer Security and Privacy, Englewood Cliffs, NJ: Prentice‑Hall, 1977.

IMSL STAT/Library - Fortran Subroutines for Statistical Analysis (Vol. 3), Problem Solving Software Systems, Houston, TX, 1989.

Jain, L.R. "On fitting the three‑parameter log‑normal distribution to consumer expenditure data", Sankhya (39:1), 1977, pp. 61‑73.

Johnson, N.L., and Kotz, S. Distributions in Statistics: Continuous Univariate Distributions ‑ 1,  New York: John Wiley, 1970.

Laudon, K.C. Dossier Society: Value choices in the design of National Information Systems,  New York: Columbia University Press, 1986.

Laudon, K.C., and Laudon, J.P. Information Systems Management: A Contemporary Perspective (2nd. Ed), New York: Macmillan Publishing Co., 1991.

Lawrence, R.J. "The log‑normal distribution of buying frequency rates", Journal of Marketing Research (17:2), 1980, pp. 212‑220.

Lefons, D., Silvestri, A., and Tangorra, F. "An Analytic approach to statistical databases", Proceedings of the 9th Conference on Very Large Databases, Florence, Italy, 1983, pp. 260‑273.

Liew, C.K., Choi, U.J., and Liew, C.J. "A Data Distortion by Probability Distribution", ACM Transactions on Database systems (10:3), September 1985, pp. 395‑411.

Mason, R. "Four Ethical Issues of the Information Age", MIS Quarterly (10:1), March 1986, pp. 5‑12.

Matloff, N.E. "Another look at the use of noise addition for database security", Proceedings of the IEEE Symposium on Security and Privacy, 1986, pp. 173‑180.

Miller, A.R. The Assault on Privacy‑Computers, Data Banks and Dossiers, Ann Arbor, MI: University of Michigan Press, 1971.

Muralidhar, K., and Batra, D. "An Investigation of the Effectiveness of Statistical Distributions for Additive Fixed Data Perturbation", Florida International University, Working Paper #90‑01, 1990.

Neter, J., and Loebbecke, J.K. "Behavior of major statistical estimators in sampling accounting populations: An empirical study", AICPA, 1975.

O'Neill, B., and Wells, W.T. "Some recent results in lognormal parameter estimation using grouped and ungrouped data", Journal of the American Statistical Association (67:1), 1972, pp. 76‑79.

Ozsoyoglu, G., and Chin, F.Y. "Enhancing the security of statistical databases with a question‑answering system and a kernel design", IEEE Transactions in Software Engineering (8:3), 1982, pp. 223‑234.

Ozsoyoglu, G., and Ozsoyoglu, M. "Update handling techniques in statistical databases", Proceedings of the 1st LBL Workshop on Statistical Database Management, Berkeley, CA, December 1981, pp. 75‑83.

Ozsoyoglu, G., and Su, T.A. "Rounding and Inference control in conceptual models for statistical databases", Proceedings of the IEEE Symposium on Security and Privacy, 1985, pp. 160‑173.

Palley, M.A. "Security of statistical database compromise through attribute correlational modeling", Proceedings of the IEEE Conference on Database Engineering, 1986, pp. 67‑74.

Palley, M.A., and Simonoff, J.S. "The use of regression methodology for confidential information in statistical databases", ACM Transactions in Database Systems (12:4), December 1987, pp. 593‑608.

Parker, D.B. Computer Security Management, Reston, VA: Reston Press, 1981.

Quandt, R.E. "On the size distributions of firms", American Economic Review (56:3), 1966, pp. 416‑432.

Reiss, S.P. "Practical data swapping: The first steps", ACM Transactions in Database Systems (9:1), March 1984, pp. 20‑37.

Schlorer, J. "Security of Statistical Databases", ACM Transactions on Database Systems (6:1), March 1981, pp. 95‑112.

Simon, H.A., and Bonini, C.P. "The size and Distribution of Firms", American Economic Review (48:4), 1958, pp. 607‑617.

Steindl, J. Random Processes and the Growth of Firms, London: Griffen Publishing, 1965.

Straub, D.W., and Hoffer, J.A. "Computer Abuse and Computer Security: An Empirical study of Contemporary Information System Security", IRMIS (Institute for Research on the Management of Information Systems, Indiana University School of Business, Bloomington, IN), Working Paper #W801, 1987.

Straub, D.W. "Organizational Structuring of the Computer Security Function", Computers and Security (7:2), 1988, pp. 1‑11.

Straub, D.W. "Effective IS Security: An Empirical Study", Information Systems Research (1:3), September 1990, pp. 255‑276.

Thatcher, A.R. "The distribution of earnings of employees in Great Britain", Journal of the Royal Statistical Society, (131:1), 1968, pp. 133‑170.

Traub, J.F., Yemini, Y., and Wozniakowski, H. "The statistical security of a Database", ACM Transactions on Database Systems (9:4), December 1984, pp. 672‑679.

U.S. Department of Health, Education and Welfare Advisory Committee on Automated Personal Data Systems. Records, Computers, and the Rights of Citizens, Cambridge, MA: Massachusetts Institute of Technology Press, 1973.

Wong, K. "Computer Crime ‑ Risk Management and Computer Security", Computers and Security (4:4), 1985, pp. 287‑295.

Wysocki, R.K., and Young, J. Information Systems: Management Principles in Action, New York: John Wiley and Sons, 1990.


 

    [1] For a complete review of the genesis, moments, and other characteristics of the Log-normal distribution, refer to Johnson and Kotz (1970).

    [2] While the perturbed datasets presented are based on a single realization, they demonstrate  the concepts discussed. A more detailed analysis is presented in later sections.

    [3]  The relationships derived for MAP, variance, and product moment correlation are not distribution specific and can be applied to any distribution of X or e.

    [4]  The number of replications were selected as 1000 after preliminary investigations revealed that the estimates would be within + 1% of the true parameter value.

    [5]  The authors wish to thank the three anonymous reviewers and the associate editor for their helpful comments and suggestions.

 

This page was last updated on 04/08/04.