in High Dimensional Data Mining in \\Complex Manufacturing Processes\\\small{}

High Dimensional Data Mining in
Complex Manufacturing Processes

David Forrest1

To be defended Wednesday, 14 November 2001, 10:00-12:00
in Olsson Hall, Room 111A
Revised:

1  Introduction

In order to control and improve chip production in semiconductor manufacture, a company may seek to use manufacturing and process data already recorded during the process to more fully understand the system and provide avenues for improvements. Semiconductor manufacturers collect a large amount of data, in terms of storage space and number of variables, but small in terms of the number of coherent observations. While any particular operation may have a number of observations, the large number of monitored variables produce effects similar to short run manufacturing processes: insufficient degrees of freedom to reliably model the process. This work will produce a modeling methodology for managing hierarchical manufacturing data and will seek to produce useful models for semiconductor manufacturing, and complex manufacturing systems in general.

Smaller lot sizes and more flexible manufacturing processes, along with the increase process complexity, combine to produce short-run processes. As more automated measuring and recording equipment enters the manufacturing process, a difficulty with the dimensionality of the problem emerges: runs shorter than the dimensionality of the problem. A high dimensional manufacturing process can have fewer distinct observations n than the number of process variables p. In prior work with a semiconductor manufacturer, we established that direct models of Yyield = bXprocess can be ill-defined due to the dimensions of the data matrix X{n ×p} where n << p. These conditions lead to instability in the parameter estimates b for a linear model, and singularity of the cov(X), but they occur in complex manufacturing processes. Realizing that the degrees of freedom in the system is limited by the number of observations, additional constraints must be placed upon the overall model. Assuming a hierarchical structure to the process, i.e. that the outputs of the system are functions of certain key parameters of the system, which are in turn functions of lower level operations in the production process, may constrain the models into estimable and testable problems.

High dimensional data analysis requires effective visualization methods, since traditional methods such as run plots and scatter plots do not scale well to high dimensional systems . Using methods from clustering to sort the variables and observations can aid in high dimensional visualization. A certain manufacturer seeks a ``Golden Signature'' program which intends to identify production parameters from high quality lots and estimate the effects of deviations from these ideal parameters. This ``Golden Signature'' program requires a comprehensive model relating the many low level process variables to the high level yield variables.

Although difficult, this project is an ideal application of systems engineering due to the complex system managed by different groups of people. Semiconductor manufacturing is an extreme case of complicated manufacturing systems, with issues of discrete part manufacturing, aggregations and disaggregations of data in time, production lots, and production processes. The interactions between production, engineering, information technology, and management indicate a need for an interdisciplinary method integrating the process. The general methodology proposed here is to develop a methodology using hierarchical structures inherent in the manufacturing and engineering processes to manage the complex models and high dimensionality of manufacturing processes.

2  Problem Statement

The extraction of meaningful inferences from the large mass of manufacturing data generated by the semiconductor manufacturing process in particular is problematic in two significant ways: first, the data is not often stored in a form amenable to analysis, and second, the large amount of data is often very small when compared to the complexity of a model representing the semiconductor manufacturing plant. This research focuses on these two problems: managing large (n << p) data for analysis, and using the data to discover interesting relationships.

2.1  Background of Semiconductor Manufacturing

Commercial semiconductor devices are manufactured in and on the surface of wafers from large ultra-pure crystals-thin disks, typically 200mm or 300mm in diameter. An area on the wafer containing a single discrete device or integrated circuit (IC) is called a chip or die. Depending on the dimensions of the wafer and the dies, several hundred chips are formed on a single wafer.

During fabrication, wafers are transported and processed in standard lots of twenty-five wafers each. Each lot undergoes hundreds of individual processing steps, in which different parts of the ICs are etched in thin layers of material grown or deposited on the working surface of the wafers. Each process step must be tightly controlled to ensure dimensional tolerances typically measured in nanometers.

Fabrication of a single lot requires approximately three months. Throughout, process settings, engineering parameters, and test data are logged for each fabrication tool at both the wafer and lot level, via a central computer network called a manufacturing execution system. With as many as 5000 wafer starts a week, process and engineering databases requiring hundreds of gigabytes of memory are normal.

Figure
Figure 1: Schematic of Simple Model of DS and QC data. Note that the number of rows (lots) differ in each data table, and that the intersection of all data rows for this 90 day sample of data is n = 221, p = 21710.

The data details the manufacturing processes involved in the production process. Analysis of the data differs from current data mining techniques developed for business sales information, market-basket analysis, image analysis, or spatial data because of the large number of variables, interactions between sub-processes and relatively small number of observations. For example, a memory device involving 22 layers of semiconductor can involve 524 processing steps over 3 months with 21710 process variables. Figure 1 shows a sample of 90 days of lot level production data for one product, the misalignments between separate data tables, and that n = 221 << 21710 = p. Besides vast amounts of data, another challenge is that the measurements are commonly collected on different aggregations of parts at the chip, wafer, batch and lot levels. Since the measurements for a particular chip are spread out over time, collected at different aggregation levels and are many with respect to the production yield data, current data mining and analysis techniques such as clustering and linear regression modeling are inefficient and difficult to apply to semiconductor manufacturing environments.

The target of the proposed work is at the system operational level and is to extract knowledge from data from sophisticated processes in order to improve operations - that is to improve productivity, decrease ramp-up time, identify and validate quality control parameters, these will ultimately increase yield. The anticipated research will focus on two areas: operational modeling of manufacturing data and data representation and manipulation. I will develop a methodological approach to solving the complex modeling problems that arise in semiconductor manufacturing. I will also show how subsystems of the manufacturing process could be combined to produce an overall model suitable for process monitoring and improvement.

2.2  Mathematical Model of the Process

One way to think of the semiconductor manufacturing process is as a dynamic program with a stage for each processing step. The various testing results, such as wafer yield, die sort yield, bit-fail maps, margin yield, bit retention time, refresh rates, and others are the outputs of the process at the last stage. The data produced at each step adds to the total information available at that stage. Mathematically, the path of a wafer through the production process could be represented as:
n
=
0,...,N
I0
=
What is known before starting production
X0
=
The initial state of the wafer
A0
=
The initial Action
In
=
{In-1,an-1,Yn-1};  Information vector
An
=
Sn(In); Action vector
Xn
=
fn(Xn-1,an-1,dn); State vector
Yn
=
Mn(Xn,en); Results vector
where: n is the stage or step of the process, In is the what we know at step n, An is the action we choose, (which is based on some strategy Sn that uses the information we know), Xn is the ``true'' state or condition of the wafer after the step (which is unknowable, and may be dependent on some disturbance dn), and Yn is the new data we learn from the process step through a measuremoent process Mn including errors en. YN is the last information we take from a part, and would be the final test data. If you know YN for a wafer, you can calculate the die sort yield, the margin yield and the sales price, however, profit also depends on costs, which in turn depend on the production history.

If we think of each uppercase In, An, Xn, Yn as a row vector of the various parameters in each processing step, the current database system records the different An vector of machine settings and Yn vector of measurements at each processing step. For example, if step n = 5 is a visual inspection that is always done the same way, A5 is the constant procedure that is used for inspection, Y5 is the results of the inspection, d5 is any change in the wafer due to the inspection (e.g. a mote of dust became stuck to the part), and the part changed from an uninspected wafer without dust, (X4), to an inspected part with dust (X5), and for the next step, we know everything we did before, plus the facts that it was inspected, and the results of the inspection. For a more complex operation, masking for example, the vectors would be much more complicated: the action would have many more options and machine settings, and the step would produce much more data.

This model captures several important facts. We may not know everything important about the wafer at every stage, we measure some features that may or may not be representative of the state of the part, we can do different things at different stages, and we know more and more about the part as the part steps through the process. The IN vector holds every item of recorded data about a particular wafer, and may be thousands or millions of attributes wide. This model is very general, but has enough elements to represent several problems of interest to a manufacturer: yield improvement, design of the best recipe, in-line classification and disposition, and identification of new defects.

Each of these alternative manufacturing problems depend on estimating YN from different I0,...,N. This mathematical model is general enough to capture all of the elements of these problem, but it might also be intractable due to a number of problems:

Although the mathematics of a dynamic program provides a good structure for describing a complex semiconductor manufacturing process, estimation of the parameters and solution of a single dynamic program is not feasible due to these problems. Each of these problems might be soluble through the use of smaller, more manageable models focusing on segments of the process.

2.3  Semiconductor Manufacturing Data Collection and Storage

In a typical manufacturing plant, the semiconductor manufacturing data is recorded automatically from a number of machines in several distinct manufacturing areas. Assuming the process to operate on uniform discrete lots, typical production consists of about 25 lot starts/day flowing through about 270 operations over about 90 days. Using these estimates, one can see that 2250 lots of work in process accrue 6750 operations per day. However, this simplifying assumption of discrete lots is invalid due to batching and detailed processing; for example several lots are batched together for annealing in a furnace, while each wafer in a lot may have multiple die-shot reticle exposures. These aggregations and dis-aggregations complicate the understanding of the manufacturing process, and produce challenges in data collection. Short runs, rework, and process changes provide another source of complexity in semiconductor manufacturing as shown in Figure 2. These histograms of the process length assigned to different production lots show that there is not a consistent manufacturing process; the lower of the two histograms shows a spread of the number of check-in/check-out operations from 450 to 550, none of them more numerous than 12 lots. In terms of the mathematical model presented in 2.2, the length of the dynamic program is constantly changing, and identification of a consistent information matrix is impossible. Frequent process changes and manufacturing dispositions result in an amorphous process.

Figure
Figure 2: Process Length Variance - The count of the number of steps used to produce a lot of wafers shows significant variability in production process length. In the lower figure, notice the mode at approximately 475 process operations. In the upper histogram, note that there are large numbers of very short processes.

To expedite the collection of data, each machine operation is recorded in a transactional database whose structure mirrors the physical production and testing machinery. This optimizes data collection, but hinders data analysis. Each machine is capable of emitting a number of different measurement records, and tables corresponding to each machine and each record type are automatically updated as production flows through the machines. While this data recording was initially driven by contractual agreements with the parent companies of the manufacturing plant, current efforts seek to use this data to improve the production process. The transactional database holds all the required data, but since the results are not aligned with the batch, lot, wafer, reticle shot, or chip, using the data for analysis is not possible without intricate database queries.

High dimensional data often requires reduction of the number of dimensions in order to build knowledge. Examples of data domains similar to semiconductor manufacturing data include high dimensional data from image analysis, radar and spectral data, text recognition and mining, speech recognition, genetic code sequences, and chemometrics. Interpretation of high dimensional data is difficult, as is understanding of a high dimensional model. Several of these fields have underlying spatial or theoretical models on which to base further analysis. If the theory is lacking, however, then the use of data mining techniques to build models may help to develop theory about these complex domains. Semiconductor manufacturing differs from these domains in that the structure of complex manufacturing data is not well organized for analysis.

Examination of the semiconductor manufacturing process leads one to the question: What are appropriate methods to use large dimension, small sample manufacturing data for prediction and understanding complex manufacturing processes?

3  Literature Review

The literature supporting this research consists of several areas: modeling semiconductor manufacturing, sample size and model complexity, domains of similar complexity, and hierarchical modeling. Semiconductor manufacturing literature consists of high level yield modes and low level device performance models. Similar complexity domains exist in image analysis, bioinformatics, and chemometrics.

3.1  Semiconductor modeling

Yield modeling is a key goal of semiconductor manufacturers. Improvements in yield prediction and yield provide direct financial gain to manufacturers. Yield models depend primarily on the combination of defect rates from the various sub-processes to predict yield rates. Unfortunately, the defect rate information attributed to specific sub-process is expensive and slow to collect, so manufacturers are interested in methods of using the process and test data to predict physical characteristics, and further, defect rates.

provides a good overview of the entire semiconductor manufacturing process, while [201998Horton] further explains a number of yield modeling techniques and formulas. predicts yield based on defect density information drawn from multiple-layer inspection information using in-line pattern defect density information. model the effect of defects on lower layers in a semiconductor sandwich on upper layers.

use a sampling plan across a wafer and lot to produce defect density distributions, which can help to model yield. The paper seems most applicable to predicting yield of one program based on the defect maps of another program, (e.g. 128MiB DRAM based on 64MiB DRAM).

show a number of defect characterization statistics in order to more fully understand low yield. The methods of quadrat statistics (defect per die), spatial point pattern statistics for spatial randomness, and spatial clustering, and collinearity identification are discussed. A fuller explanation of spatial clustering monitoring by provides guidance in creating test statistics from a pick-map and using them to monitor wafers for spatial randomness. explain a method for monitoring large area defects by separating a smoothed cluster from the underlying spatially random component.

, working with SEMATECH, propose an object-oriented database to provide responsive control during the manufacturing process. This method recognizes the hierarchy of the semiconductor manufacturing process, but requires good models of inter-process interactions (i.e. detailed models of downstream features based on upstream parameters). Given these detailed models, an ``active-database'' could generate novel recipes as the process proceeds.

[362000Richards and Shen] develop a model of the physical characteristics of a semiconductor device based on some electrical test data. This is the inverse of the problem of predicting in-line electrical test results based on process parameters.

surveys a number of modeling methods applied to semiconductor manufacturing data from the probe testing machine in a micromechanical accelerometer production process. The data is in several families, (i.e. min, max, ave, std, quartiles, and quartile range) of the several monitored variables. Although this work uses wafer-level semiconductor data, it uses data from only one step in the manufacturing process. They use some ad-hoc clustering to stratify the yields before applying some tools: On the set of low yield wafers, the models did not work well. used PCA based regression to reduce the dimensionality , but the resulting model was poor, and interpretation is difficult. The model complexity is 128 = 23 ×6 parameters based on 1123 observations, which is about a 1:10 ratio. The micromechanical accelerometer in may be well described by 23 features, but a 64MiB memory chip is a much more complicated device with many more elements.

discuss data volume in semiconductor manufacturing as 1,000,000 wafer tranasctions per day, this is consistent with what we see in other fabrication plants, 25 lots of 25 wafers starting and progressing through a 500 step process in 90 days, assuming three transactions per wafer per operation. They use a hybrid Neural Network with memory (implemented as a k-nearest neighbor vector). The k-NN approach helps the interpretation of the reasoning done by the Neural Net, and improves the straight neural net performance by adding a the outputs of a k-Nearest Neighbor model to the inputs of a neural net.

These papers have shown that yield modeling in semiconductor manufacture is an area of significant current interest, that attempts are being made to estimate physical characteristics based on easily measured electrical characteristics, and that yield modeling is still not perfectly understood. Opportunities exist for linking the various levels of semiconductor modeling to produce useful models of yield.

3.2  Complexity, Sample Size, and Modeling

Models generally require a sample size n of about 6-10 times the number of parameters p in the model.. From this, one might assume a model of with dimensionality p of [1/ 10]n to 1/6n might be a reasonable model. This relation between model data and model complexity reflects a limitation on the degrees of freedom and the number of estimable parameters in a model. Some domains, such as semiconductor manufacturing, image analysis, text mining, genomic data, and speech analysis, have small numbers of observations compared to the variables present. Managing the complexity of these complex domains requires methods which exploit the structure, theory, and knowledge of these fields. A method which exploits the hierarchy inherent in a manufacturing process can provide a tool to manage the complexity of the manufacturing data.

An examination of different types of models and the dimensionality of their data provides insight into the use of degrees of freedom in model building. Large data can be large a number of different ways depending on the shape, size, and storage requirements of data. Data with large storage requirements can cause slow processing. Data with large numbers of observations can also impact processing speed. Data with large numbers of dimensions or variables with respect to the number of observations can cause problems with modeling through the estimation of the covariance of a data set.

A simple linear model Y = bX which does not does not require a covariance matrix estimate, but only estimates of point values of the coefficients requires a b for each x in the model. Even if the b terms are zero, it requires a degree of freedom to estimate them as zero. This simple model requires at least p observations in order to make estimates of the process. A more rigorous linear model additionally estimates standard errors of the model coefficients in order to determine if the parameters are significant or not. These two sets of parameters imply that 2 ×p degrees of freedom are consumed by a simple modeling process.

Using the simple Hotelling multivariate process monitor, T2 for example, requires estimating p2 covariance and p mean coefficients. The number of degrees of freedom consumed by these estimates exceed those of linear models by including covariance terms between each pair of variables.

The small sample sizes available over a time span of interest provide for only small models relative to the potential dimensionality of the problem. In order to provide useful models with only a limited number of observations, the models should be limited to a complexity smaller than the number of observations. Complexity in this sense is the number of parameters in the model.

A very small model of the manufacturing process might estimate two terms representing a index of a process step and its effect on yield. For example: Process yield is 0.75 plus some factor times the anneal temperature in step 255. Estimating confidence intervals of the intercept and factor would consume four observations, leaving the rest of the observations to estimate the uncertainty in the predicted yield. The problem with models like these is that there are a great number of competing models, and the uncertainty in model parameters nearly guarantees acceptance of invalid models.

More data would be consumed to validate models, and to choose between competing models [251998Kennedy and et al.]. As an extension to the general wisdom of a sample size of 6-10 times the complexity of the dataset, [182000He and Shau] establish bounds on the increase of complexity of a model as the sample size increases. Their limits are based on the types of functions being estimated (i.e. linear and logistic regressions and a spatial median), and the continuity of the functions. Discontinuous functions can support less complexity on the same data, while increasingly large samples can support more complex models, but not at the rate of increase of the sample size (e.g, a sample of 100 points supporting a 10 term model would better satisfy asymptotic assumptions than a 1000 samples of a 100 term model). Under one reference they cite, a linear model without discontinuities would support only about 3 times as many terms with 10 times as many samples: 31 terms on 1000 samples is similar to 10 terms on 100 samples.

3.3  High Dimensional Modeling and Visualization

Understanding hyperdimensional (p > 3) datasets is difficult; although many automated systems exist for fitting multivariate models, comparing and understanding multidimensional models is challenging. Although model fitness can often be reduced to a single score, the relevance and meaning of the model inputs is often uninterpretable. Visualization tools for high dimensional data are essential for understanding the data and models . survey multivariate visualization software and suggest that quantitative programming environments such as MatLab, Mathematica, and S/S-Plus/R provide powerful, flexible, and reproducible graphics and explorations. Examples of visualization used to confirm and explore multivariate modeling are shown in [412001Wilhelm et al.]. Essentially, the problem of visualization of a hyperdimensional dataset is accomplished in three ways: selection of small subsets of the original variables, transformation to a small dimensional space, or decomposition of the problem into several less complex sub-problems.

show a test for structure, basically by removing the first and second moments in the data, then studying the multivariate distribution with a chi-squared test. This leads into other methods of high dimensional visualization, such as the Sliced Inverse Regression (SIR) . SIR, which bins the output variable, calculates the corresponding means and covariance of the input variables, reduces the dimension of the predictors, and then examines the output variables in the reduced space .

3.4  Reduction of dimensionality

consider algorithms for reducing the dimensionality of an input data space, distinguishing between different transformations of features used for data mining. Feature subset selection chooses a reduced set of the original variables (e.g. stepwise forward selection). Feature extraction produces a new set of variables as a simple function of the original variables (e.g.: principal components analysis). Feature construction creates new variables, (e.g. power = voltage × amperage).

Several feature extraction methods, such as Principal Component Analysis (PCA), Singular Value Decomposition, (SVD), Factor Analysis, and Partial Least Squares (PLS), can be used to code the original variables in a smaller dimensional space. PCA produces a set of uncorrelated linear combinations of the initial variables ranked by their contributions to the overall variance . Each PCA component includes each of the original variables, encoded in the associated eigenvector. Singular value analysis, is a method of characterizing a data matrix of less than full rank with eigenvalues, eigenvectors, and an orthonormal basis matrix (Bildau and Brenner, 1999). [22000Alter] uses SVD to reduce the dimensionality of a high dimensional gene data ({n ×p} = {14 ×5981}) to a smaller space of `eigengenes'. These feature extraction techniques map the high dimensional data into a different space, then truncate the dimension of the new space into a smaller dimension.

In contrast to feature extraction methods are feature selection methods that attempt to choose a subset of the initial variables while maintaining the information required to reliably model the process. Feature selection methods choose and exclude variables from an analysis based on some measure of relevance. Methods of the variable subset selection include nested model methods such as backwards elimination or forward selection in regression using changes in model R2; decision trees such as C4.5 or CART that use an information measure to rank variables, and manual methods using expert advice from domain experts.

[51993Bocchieri and Wilpon] discuss the addition and elimination of features in a speech recognition problem. As equipment becomes faster, new features and higher order transformations of the original features become available. The new speech recognition variables can improve the accuracy of speech recognition algorithms, but the computational complexity of the algorithms becomes an issue. [51993Bocchieri and Wilpon] suggest a method for limiting the number of features based on a misclassification distance in each of the dimensions. [221994John et al.] suggest a elimination of irrelevant variables using a ``wrapper'' technique based on stepwise selection or elimination of features and applying the data mining technique to each of the subsets. [221994John et al.] develop a definition of weak relevance based on conditional dependence on a subset of variables. [162000Hall and Holmes] compare several methods of attribute selection and suggest information gain and a correlation based method for high dimensional data. [152000Hall] develops a correlation based feature selection method as a heuristic search of all subsets of features. [421999Wu and Urpani] suggest eliminating the least relevant features rather than selecting the most relevant in order to handle messy data. propose several random search methods for selecting subsets from high dimensional data.

3.5  Alternate high-dimensional domains and approaches

Image analysis, text mining, speech recognition, spectral analysis, and bioinformatics databases are commonly stored in formats tailored to the specifics of the data and the processing algorithms. Large databases of high dimensional data such as images, text, or speech recognition require dimension reduction to produce summaries and indexes suitable for using this data. Much work has been done on the transformation algorithms in these specialized fields, but the software is often single use or proprietary to the specific application. Bioinformatics data consists of high dimensional genetic information on a limited number of samples, with relatively few genes in the genome responsible for a particular phenomenon, possibly similar to a small number of semiconductor processing parameters being responsible for a particular defect. Spectral analysis of chemical mixtures uses high dimensional spectrogram data to estimate low dimensional mixtures of compounds. Although work has been done in many high-dimensional systems, the single-purpose and tailored systems are not directly applicable to semiconductor manufacturing because of the complexity of the data structure.

Feature extraction methods are often used to summarize and index high dimensional databases for similarity searches. survey several systems for image storage and retrieval. Dimensional reduction of color or spectral histograms, and texture signatures derived from fourier transforms of the images are also discussed.

Each semiconductor chip, wafer, lot, or batch carries a large number of independent process variables and characteristic measurements which may differ with each chip/wafer/lot/batch. Image analysis contains a large numbers of pixels and their associated characteristics which may differ for each image. For example, digital cameras routinely produce 1.3 megapixel images, reducing these to a simple greyscale image of 100x200 pixel by 8 bit depth for internet web presentation produces an array of pixels containing 20000 variables with 256 levels. Dimensional reduction is a strategy of creating summary or signature features (or variables) that may give an analyst a better perspective than a pixel-by-pixel representation. For example, an analyst could query the database for images that are `green', or in manufacturing, creating indices by lot number, by process and by yield. This facilitates similarity signature modeling in that an analyst could request all of the lots similar in yield to lot X, Y, and Z for example. In addition, a practical consideration of indexing strategies is that the number of attributes or fields in relational tables is limited. For example, the commercial database program, Oracle 7, limits the number of attributes to 256 in a table. use a multiple level filter to manage high dimensional data. They find that color, texture and `eigenface' representations of image data may generate 256, 240, or 400 dimensions, respectively. Compared to the original dimensionality of the image data, the reduction is dramatic, but search through an index with > 20 dimensions essentially degenerates into a sequential search. solve this problem with a system for storing a multidimensional index in a hierarchy based on transformed features.

explain some chemometric tools for prediction, such as ridge regression, partial least squares, and principal components regression, in domains where the number of variables far exceeds the number of observations. For example, spectrum analyses use high dimensional data (p > 1000) to estimate fractions of chemicals in mixtures.

Gene expression data is a domain with small sample sizes and large domain in which a large number of the variables are irrelevant to the problem of interest. [101998Eisen et al.] describes data of n » 102, p » 105 and a method for clustering variables based on a correlation coefficient using average linkage, then displaying them for human interpretation. The general model using gene arrays is to take samples of the biological manifestation, and then compare gene arrays in order to identify genes that are related to the question of interest. use mean hypothesis testing, checking for statistical significance of difference between variables given classes of outputs to determine relevance of genetic information to a problem of interest.

Mining of text databases for relevance and indexing is a high dimensional system, with authors, sources, lengths, ages, names, and words forming a complex parameter set that causes difficulty in extracting reliable information. compare a number of feature subset selection methods applied to a text mining and propose a method similar in performance to c2 and information gain procedures. consider text mining as a time series through multiple sensors and find it tractable using rule discovery techniques. In text mining, consider information gain to be an effective feature selection method for the high dimensional problems. use clustering to reduce the dimensionality of a text and image documents for more effective searches. [291998Liu and Setiono] propose a scalable Las Vegas algorithm for selecting subsets of features for data sets with large numbers of features and large number of observations.

4  Approach and Preliminary Results

The problem of using high-dimensional and low observation complex manufacturing data for prediction and understanding requires a method for managing the complexity. Since the data is not stored in a simple data matrix, but instead in a set of process-related distinct databases, standard regression, feature selection, and feature extraction methods are not easily applied. Process changes produce missing data and make construction of a full rank data matrix impossible. Modeling the semiconductor manufacturing system using a hierarchy congruent to the manufacturing and engineering process may provide a method for managing the complexity of the manufacturing system. By using the ``natural'' divisions in the process and data collection and storage system, the entire system can be broken down into subcomponents, examined and modeled separately, and recombined in a hierarchy of interlocking models. The smaller sub-models, that still may have problems of dimensionality, irrelevant data and missing data, will be more tractable than the overall model.

4.1  Research Plan

My overall plan for developing a methodology for managing complex manufacturing data as found in semiconductor manufacturing includes several stages and tasks:

Figure
Figure 3: Hierarchical Model of DS, TEG and QC data. Solid lines represent current linkages and models, dashed lines represent databases, and dotted lines represent future linkages and models.

4.2  Proposed Model

In contrast to looking for the impact of each of the approximately 21000 variables on the outputs of interest, a hierarchical model using the engineering variables of interest could provide a way to manage the problem of dimensionality and lack of degrees of freedom.

Two internal reports produced at the manufacturer summarize production and testing data: the TEG electrical characterization data through WIPNavigator, and the SIView quality control and process data reports. These two reports represent variables of special interest to the process and design engineers. The internal reports presents the variables in these datasets with scatter plots against yield and other output variables, encouraging univariate models. Discussions with process and design engineers help to build the relationships in Figure 3. Combining these variables with analysis is a first step in creating a model of yield or other process outputs. Each of these variables could be used as response variables for subsets of the process data. The resulting hierarchical model could use the process data to predict the engineering variables, and then in turn predict the yield response. Although the global model uses the same data, the structure imposed by the constraints reduces the number of parameters estimates required by a multivariate model, and helps manage the short run and small sample sizes.

Specifically, DS contains 27 lot level TEG contains 71 measurements, each a vector of lot summary statistics, monitoring Prime Specification Limits (PSLs) thought important by process engineers. QC contains 21710 variables separated into measurement and process data. Quality control and process engineers have selected 31 of the QC measurement variables as key Engineering Specifications (ES), each a vector of several lot summary statistics and wafer samples.

Modeling the PSL variables as outputs of lower level processes, such as the ES, and the ES parameters as outputs from QC measurement and test data divides the manufacturing system into ``natural'' subcomponents. Also, modeling the yield performances in the die sort database as functions of the TEG PSLs may aid in assessing the relative benefits of measuring the PSLs.

Figure 3 shows a number of modeling efforts that may succeed in relating the low level production data to the high level yield and efficiency data. Each connecting line in figure 3 represents a smaller, more manageable model, that may provide insight into the wafer production process. For instance, at the bottom of the figure, each of the several dotted lines link some subset of production data to an engineering specification. It may be possible to discover and control the process parameters contributing to the TV nitride measure using partial least squares or other regression techniques to develop lower and intermediate models. By monitoring and controlling the engineering specifications, it should be possible to understand, predict, and control of some characteristics in electrical test and die sort.

Extraction of some die sort and engineering test data from a production system has already been accomplished. What remains is extracting production and measurement information, clarifying the hierarchy between elements of these data, building some of the many possible sub-models, and showing that the sub-models can be combined in a hierarchical model.

Referring to Figure 3, I will build a limited number of models at different levels of the process, relating the outputs of lower level processes to higher level variables of interest. Specifically, data will be cleaned and feature-selected from a database table, aligned with higher level output data, and models will be built using methods such as multiple regression, principle components regression and partial least squares. The smaller models may be amenable to linear regression, general linear models or partial least squares, depending on dimensionality and the output variables. The organizational hierarchy of the existing data, shown in figure 3, allows for testing of the sub-models. Relating the elements of the production hierarchy this way should demonstrate that although a simple multiple regression on the process parameters to the yield output is not possible, a detailed overall model of the production process may be successful.

At this writing, domains of similar complexity have been explored in the literature, data has been extracted, preliminary models of die sort variables as a function using in-line testing data have been attempted. I have used SAS summary data, visualization tools, and elicited expert knowledge to manually select relevant feature subsets, and built models of in-line-testing data.

Separating the production margin yields into the contributors to margin yield failures allows modeling of the different defect rates separately, and can provide a more detailed explanation of the correlations between electrical test and die sort data.

5  Anticipated Results and Summary

The organization of the data at the semiconductor manufacturer has not supported easy data analysis. The internal tools, SIView and WIPNavigator, produce a large number of graphs and scatter plots which results in a number of competing ad hoc univariate models of production yield problems based on single variables and processes. By building a few multivariate models and linking them together in a hierarchical model that mirrors the production organization, I hope to provide a mechanism for combining models from the entire process. Our work in this project has inspired the manufacturer to build new data extraction tools with improved query performance, and larger, more relevant data sets are now available. The production and quality control data (QC) are now available in a form which can be aligned with the in-line testing (TEG) and die sort (DS) data.

Multivariate models including the entire process are impractical due to data structure problems. By developing a framework for relating smaller models in a hierarchical manner, a complex manufacturing process can be broken down into more manageable smaller models which can be combined to produce an overall model.

Opportunities exist to build a number of sub-models relating process parameters to key engineering parameters, key engineering parameters to electrical test data, and electrical test data to die sort and yield data. Feature selection and feature selection methods can help to reduce the dimensionality of the sub-models in cases where the sub-models have rank problems. Combining these models to build a hierarchical model could increase understanding of the semiconductor manufacturing process. Although the combined model could represent the entire process, it is hoped that the sub-models and overall model will enhance understanding of the process as compared to an under-determined overall model and the man parallel univariate models. Some of the sub-models may prove useful and provide avenues for improvements of the process.

The extraction of useful information from the large storage space and high dimensional databases of semiconductor manufacturing may provide substantial benefits in yield modeling. Using a methodology which captures the hierarchical nature of the semiconductor manufacturing process can provide a method for managing short-run, data-poor complex processes. By segmenting the model into discrete units, multiple less-complex models can be created from the available data. Ultimately, the development of a hierarchical model can help manage the complexity of the semiconductor process and relate high-level yield information and electrical parameters to specific manufacturing process parameters. The methodology used to produce and combine these models into a comprehensive model of the semiconductor manufacturing process will be applicable to other complex manufacturing processes, and also to other complex systems with n << p.

References

[12001Ahonen et al.]
Ahonen, H., Heinonen, O., Klemettinen, M., and Verkamo, A. I. (2001). Applying data mining techniques for descriptive phrase extraction in digial document collections. Not Published.

[22000Alter]
Alter, O. (2000). Singular value decomosition for genome-wide expression data processing and modeling. Technical report, Uxxxx. Mathematical and Computational Challenges in Molecular and Cell Biology, Berkley, California.

[31999Aslandogan and Yu]
Aslandogan, Y. A. and Yu, C. T. (1999). Techniques and systems for image and video retreival. Knowledge and Data Engineering, 11(1):56-63.

[41994Basilevsky]
Basilevsky, A. (1994). Statistical Factor Analysis and related methods: theory and applications. John Wiley and Sons.

[51993Bocchieri and Wilpon]
Bocchieri, E. L. and Wilpon, J. G. (1993). Discriminative feature selection for speech recognition. Computer Speech and Language, 7:229-246.

[61978Box et al.]
Box, G., Hunter, W., and Hunter, J. (1978). Statistics for Experimenters. New York: John Wiley & Sons, Inc.

[71998Chaudry et al.]
Chaudry, N., Moyne, J., and Rundenstiener, A. (1998). Active controller: Utilizing active databases for implementing multistep control of semiconductor manufacturing. IEEE Transactions on Components, Packaging and Manufacturing Technology-Part C, 21(3).

[81998Cunningham and MacKinnon]
Cunningham, S. P. and MacKinnon, S. (1998). Statistical methods for visual defect methodology. IEEE Transactions on Semiconductor Manufacturing, 11(1):48-53.

[91998Dumais et al.]
Dumais, S. T., Platt, J., Heckerman, D., and Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of ACM-CIKM98, pages 148-155.

[101998Eisen et al.]
Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. In Proceedings of the National Academy of Sciences, volume 95, pages 14863-14868.

[112001Forrest and Mastrangelo]
Forrest, D. and Mastrangelo, C. (2001). Gemini Die Sort and TEG data analysis for Dominion Semiconductor. Technical report, University of Virginia.

[122000Fowler et al.]
Fowler, J. W., McCarville, D. R., Montgomerty, D. C., Rhoads, T. R., Runger, G. C., Skinner, K. R., and Stanley, J. D. (2000). Multivariate statistical methods for modeling and analysis of wafer probe test data. Private Communication.

[131993Frank and Friedman]
Frank, I. E. and Friedman, J. H. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35(2):109-135.

[141997Friedman et al.]
Friedman, D. J., Hansen, M. H., Nair, V. N., and James, D. A. (1997). Model-free estimation of defect clustering in integrated circuit fabrication. IEEE Transactions in Intergrated Circuit Manufacturing, 10(3):344-359.

[152000Hall]
Hall, M. A. (2000). Correlation-based feature selection for discrete and numeric class machine learning. In Proceedings of the Seventeenth International Conference on Machine Learning. Morgan Kaufmann Publishers.

[162000Hall and Holmes]
Hall, M. A. and Holmes, G. (2000). Benchmarking attribute selection data for data mining. Technical report, Department of Computer Science, University of Waikato.

[171997Hansen et al.]
Hansen, M. H., Nair, V. N., and Friedman, D. J. (1997). Monitoring wafer map data from integrated circuit fabrication processes for spatially clustered defects. Technometrics, 39(3).

[182000He and Shau]
He, X. and Shau, Q.-M. (2000). On parameters of increasing dimension. Journal of Multivariate Analysis, 73:120-135.

[191999Hess and Weiland]
Hess, C. and Weiland, L. H. (1999). Extraction of wafer-level defect density distributions to improve yield prediction. IEEE Transactions on Semiconductor Manufacturing, 12(2):175-183.

[201998Horton]
Horton, D. (1998). Modeling the yield of mixed-technology die. Solid State Technology, pages 109-119.

[212000Huffer and Park]
Huffer, F. W. and Park, C. (2000). A test for multivariate structure. Journal of Applied Statistics, 27(5):633-650.

[221994John et al.]
John, G. H., Kohavi, R., and Pfleger, K. (1994). Irrelevant features and the subset selection problem. In Cohen, W. W. and Hirsh, H., editors, Machine Learning: Proceedings of the Eleventh International Conference, pages 121-129. San Francisco,CA: Morgan Kaufman Publishers.

[231992Johnson and Wichern]
Johnson, R. A. and Wichern, D. W. (1992). Applied Multivariate Statistical Analysis. Englewood Cliffs, NJ:Prentice Hall.

[242000Kamimura et al.]
Kamimura, R. T., Bicciato, S., Shimizu, H., Alford, J., and Stephanopoulos, G. (2000). Mining of biological data I: Identifying descriminating features via mean hypothesis testing. Metabolic Engineering, 2:218-227.

[251998Kennedy and et al.]
Kennedy, R. L. and et al. (1998). Solving Data Mining Problems through Pattern Recognition. Prentice Hall.

[261999Li et al.]
Li, C., Chang, E. Y., Garcia-Molina, H., and Widerhold., G. (1999). Clindex: Clustering for similarity queries in high-dimensional spaces.

[272001Li]
Li, K. C. (2001). Dimension reduction and visualization. http://www.stat.ucla.edu/~kcli/sir-PHD.pdf.

[281998Liu and Motoda]
Liu, H. and Motoda, H. (1998). Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers, Boston.

[291998Liu and Setiono]
Liu, H. and Setiono, R. (1998). Some issues on scalable feature selection. Expert Systems with Applicationa, 15:333-339.

[302001Liu and Setiono]
Liu, H. and Setiono, R. (2001). Some issues on scalable feature selection. Applied Intelligence, Kluwer preprint.

[312001McLeod and Provost]
McLeod, A. I. and Provost, S. B. (2001). Multivariate data visualization. Preprint.

[321996Montgomery]
Montgomery, D. C. (1996). Introduction to Statistical Process Control. New York: John Wiley & Sons, Inc.

[331996Neter et al.]
Neter, J., Kutner, M. H., Nachtsheim, C. J., and Wasserman, W. (1996). Applied Linear Statistics. Chicago,IL:Irwin, 4th edition.

[341999Ng and Tam]
Ng, R. T. and Tam, D. (1999). Multilevel filtering for high dimensional image data: Why and how. IEEE Transactions on Knowledge and Data Engineering, 11(6):916-928.

[351998Nurani et al.]
Nurani, R. K., Strojwas, A. J., Maly, W. P., Ouyang, C., Shindo, W., Akella, R., McIntyre, M. G., and Derret, J. (1998). In-line yield prediction using patterned wafer inspection information. IEEE Tranasactions on Semiconductor Manufacturing, 11(1).

[362000Richards and Shen]
Richards, W. R. and Shen, M. (2000). Extraction of two-dimensional metal-metal-oxide-semiconductor field effect transistor structural information from electrical characteristics. Journal of Vacuum Science Technology, 18(1):533-539.

[372000Shin and Park]
Shin, C. K. and Park, S. C. (2000). A machine learning approach to yield management in semiconductor manufacturing. International Journal of Production Research, 38(17):4261-4271.

[381998Shindo et al.]
Shindo, W., Nurani, R. K., and Strojwas, A. J. (1998). Effects of defect propogation/growth on in-line defect based yield prediction. IEEE Transactions on Semiconductor Manufacturing, 11(4):546-551.

[391983Tufte]
Tufte, E. R. (1983). The Visual Display of Quantitative Information. Cheshire CT:Graphics Press.

[401997Van Zant]
Van Zant, P. (1997). Microchip fabrication: A practical guide to semiconductor processing. New York:McGraw Hill, 3rd edition.

[412001Wilhelm et al.]
Wilhelm, A. F. X., Weman, E. J., and Symanzik, J. (2001). Visual clustering and classification: The Oronsay particle size data revisited. preprint.

[421999Wu and Urpani]
Wu, X. and Urpani, D. (1999). Induction by attribute elimination. IEEE Transactions on Knowlege and Data Engineering, 11(5):805-812.

[432001Yang and Pedersen]
Yang, Y. and Pedersen, J. O. (2001). A comparative study on feature selection in text categorization. Not Published.

Contents

1  Introduction
2  Problem Statement
    2.1  Background of Semiconductor Manufacturing
    2.2  Mathematical Model of the Process
    2.3  Semiconductor Manufacturing Data Collection and Storage
3  Literature Review
    3.1  Semiconductor modeling
    3.2  Complexity, Sample Size, and Modeling
    3.3  High Dimensional Modeling and Visualization
    3.4  Reduction of dimensionality
    3.5  Alternate high-dimensional domains and approaches
4  Approach and Preliminary Results
    4.1  Research Plan
    4.2  Proposed Model
5  Anticipated Results and Summary
Document Control:

List of Figures

    1  Simple Model
    2  Process Length Variance
    3  Hierarchical Model

Document Control:   

$Id: proposal.tex,v 1.6 2001/11/06 00:04:16 drf5n Exp$

1 drf5n@virginia.edu