Huge Data Sets
Go to ExaMetrix Home
   Home > Products
Login   

ExaStat

ExaStat

ExaStat allows statistical and data mining algorithms to be computed rapidly on data with millions of records (rows) and thousands of variables (columns). It extremely efficient on a single processor and it can automatically distribute computations across multiple computers and across multiple processors on each computer.

Its capacity to handle millions of records and thousands of parameters makes it possible to visualize and analyze huge data sets in ways not possible in other programs.

ExaStat provides an easy-to-use and robust platform for implementing efficient, distributed and parallelized statistical and data mining algorithms. These algorithms can be written using a simplified matrix- and object-oriented syntax (a subset of C++).

ExaStat is an ideal environment for interactively developing and testing algorithms. Equally importantly, these algorithms can be deployed as is in a production environment.

ExaStat provides extremely fast access to data from a variety of data sources, including Microsoft SQL Server 2000.

ExaStat runs on Microsoft Windows 2000/XP/2003, and will take advantage of multiple processors. For distributed computations, it can be installed on any set of networked computers, including computers connected over the internet.

ExaStat will also now run on Linux under Wine, which is available at www.winehq.com.

ExaStat is released as a free open source project under the Apache License, version 2.0. The License allows commercial development (see license information). You may download and install both a binary version and a source code version of ExaStat from our Downloads page.

If you have further questions about ExaStat, you can refer to the Q&A section below.


Q&A: Table of Contents
How can ExaStat be used?
How does ExaStat compare with other data analysis programs?
  -With interactive data analysis programs like S-PLUS and Gauss?
  -With classical statistics packages like SAS and SPSS?
  -With data mining packages like SAS Enterprise Miner and SPSS Clementine?
What are the main benefits to using ExaStat?
What are some example benchmarks?
What makes ExaStat so fast?
Is there any limit on the amount of data?
On what operating systems does ExaStat run?
What kind of data can ExaStat use?
What type of computer configuration is required?
What kind of an interface does ExaStat have?
Does ExaStat allow data to be cleaned and transformed?
What kinds of analyses can ExaStat do?
How is ExaStat implemented?

How can ExaStat be used?

ExaStat can be used as a statistics/data analysis/data mining program with a (currently) limited set of extremely high performance capabilities. It is capable of doing crosstabs, regressions, logistic regressions, and sparse boolean computations that are beyond the capabilities of other software, including SAS. It can handle more observations and more independent variables, more quickly. It can be installed on multiple computers, and can be set up to do distributed computations. Using a small set of inexpensive, networked PC's, you can use ExaStat to create a system that can rapidly do sophisticated computations on hundreds of gigabytes of data.

The ExaStat Census Demo illustrates the use of ExaStat, along with R, to interactively analyze and visualize a huge data set. This can be downloaded from our Downloads page.

ExaStat also provides a platform for implementing efficient, distributed, parallelized, and out-of-memory statistical and data mining algorithms that can be deployed as is in production environments. It is particularly well-suited for use by statisticians and scientists who have developed computational algorithms using interpreted environments such as R, S-PLUS, MATLAB, or Gauss, and who do not want to have to go through the sometimes agonizing process of working with a programmer to translate the algorithms into a compilable language for deployment.

ExaStat can be used as a data programming language that can be both interpreted and compiled; it uses a high-level, simplified subset of C++.

ExaStat can be also used as a matrix programming language.

There are four general ways to use ExaStat's functionality:

  • Interactive Mode: ExaStat comes with built-in interpreter that is particularly useful for interactive computations. This interpreter is based on the open-souce C++ interpreter CINT. The Census Demo uses the interactive mode.

  • Text Editor: ExaStat code can be written using any editor and then executed using ExaStat's integrated interpreter. For instance, the free Crimson Editor and emacs can easily be customized to “run” ExaStat code in this way.

  • Visual Studio: Visual Studio can be used as both an editor and a compiler for ExaStat code, and can also be customized to call ExaStat's interpreter. The new free version, Visual C++ 2005 Express (available at http://msdn.microsoft.com/vstudio/express/visualc/), provides an extremely good interface to ExaStat. ExaStat's setup program will customize any of the current versions of Visual Studio (2003, 2005, and 2005 Express) for use with ExaStat. You can create ExaStat projects, and can use toobar buttons and menu items which give access to the ExaStat interpreter.

  • DLL Linking: The ExaStat DLL (ExaCore.dll) can be linked by any program, such as R, that provides such a capability.

TOP

How does ExaStat compare with other data analysis programs?

- With interactive data analysis programs like S-PLUS and Gauss?

ExaStat can be thought of as a next generation version of programs such as S-PLUS and Gauss. It shares the following benefits of such programs:

  • Like them, ExaStat provides an interactive programming environment for doing data analysis and for developing new algorithms.
  • Like them, ExaStat provides large set of basic functions, so that new algorithms can easily and quickly be built on top of old ones.

However, ExaStat differs from the older generation of interactive data analysis programs in significant ways, including:

  • ExaStat is built from the ground up to be threaded and distributed. Its built-in algorithms are extremely fast and highly scalable, and it provides a platform for easily developing fast and scalable new algorithms.
  • ExaStat uses C++ as its language. There are enormous advantages to this. C++ and its preprocessor make it possible to present a simplified syntax to the typical user of ExaStat, but at the same time the full power and speed of the language are available if needed. There is no need to re-code algorithms in a new language before deployment.

-With classical statistics packages like SAS and SPSS?

These packages contain a large number of canned statistical routines, but they are neither programmable nor very scalable.


-With data mining packages like SAS Enterprise Miner and SPSS Clementine?

These packages provide easy access to standard data mining algorithms and they are somewhat scalable, but they are not programmable. These packages cannot be used to develop and deploy new algorithms, or to modify and extend existing algorithms.

TOP

What are the main benefits of using ExaStat?

As a research and development environment for data analysis algorithms, ExaStat can increase productivity and code quality substantially. Non-programmers can quickly learn the simplified subset of C++ that is sufficient for writing fully parallelized (threaded) and distributed algorithms. New code can easily be added to the system, so that code reuse is built-in. New algorithms can be built upon the large set of existing functions and operators, so that the amount of new code, and the amount of new testing, is kept to a minimum. Equally importantly, code developed in ExaStat is deployable into production as is.

As a production data analysis package in a commercial environment, ExaStat enables data to be used more profitably. It makes it possible to do statistical and data mining computations that otherwise cannot be done because of time, hardware, and software limitations. It expands the set of feasible computations, and makes it possible to extract more information from data. This makes it possible to do a better job of:

  • predicting and detecting events, especially “outliers” or other rare events.

  • finding and understanding patterns, relationships and classifications

  • determining model specifications

  • estimating model parameters

  • getting better estimates of the variance of predictions and parameters by making it feasible to do statistically valid resampling even for large problems.

TOP

What are some example benchmarks?

What follows are some benchmarks giving a general idea of the computational capacity of ExaStat. These benchmarks use a data set with 30 million records and 100 variables per record (about 12 GB of data), and are computed on a single dual-processor computer.  The data set has 50 continuous variables and 50 categorical variables with 50 categories each. This database is stored in ExaStat’s own data format. The times reported are for the entire analysis, including reading the raw data.

12 GB Database,  Single Computer

Analysis
Type

Dimensions

Total Time for Analysis

Records

Variables/categories

Descriptive Statistics1

30 million

2500 categories

16 seconds

Regression2

30 million

250 independent variables

80 seconds

Five related regressions3

30 million

250 independent variables each

85 seconds

Regression4

30 million

2504 independent variables

58 seconds

50x50x50x50 cube5

30 million

6.25 million cells

100 seconds

All above models in 1 pass6

30 million

As above

160 seconds

The benchmarks are reported for a single Dell Precision 530MT Workstation, with 2 Xeon processors running at 1680 Mhz, 512 MB RAM, 80 GB hard drive, running Windows XP.

Splitting these computations across 2 identical computers (one client and one server), each with equally fast access to its own data would reduce the times roughly by one half. Some pre- and post-computation must be done on the client computer, and there is some communication time, but for all of the examples shown these take relatively little time. There is no fixed limit to the number of servers that can be used.

These benchmarks were computed using two processors. Using only one processor can increase the computation time by up to 90%, but more typically the increases are in the range of 40-80%, depending upon a complex mix of factors.

1 The mean, variance, minimum, maximum and number missing of a single continuous variable are computed within each subsample formed by the intersection of all the categories of 2 categorical variables with 50 categories each. This computation is the equivalent, for example, of computing summary statistics for income for 30 million individuals within each of 50 age categories for each of 50 states.

2 One continuous variable is regressed on a 4 continuous independent variables plus the 250 “dummy” variables formed from 5 categorical variables with 50 categories each (including a constant term and dropping 1 dummy from each set).

3 Each of the 5 continuous variables in the previous example is regressed on all the others and the 250 dummy variables. The time for 5 regressions (85 seconds) is only a little longer than the time for 1 of them (80), because ExaStat automatically figures out the minimal number of computations required.

4 One continuous variable is regressed on 4 continuous variables plus the dummy variables denoting the 2500 subsamples formed by the interaction of 2 categorical variables with 50 categories each.

5 A 50x50x50x50 cube is formed by the interaction of 4 categorical variables with 50 categories each. Each cell in the cube contains the count of the number of valid observations (rows) that fall into the corresponding subsample.

6 All of the above models are combined into one analysis that is computed on a single pass through the data. The overall time (160 seconds) is substantially less than the sum of the times (249 seconds) for models 1, 3, 4, and 5 (model 3 already incorporates model 2) because ExaStat automatically minimizes the number of computations required for all models combined, and because reading the data can be done in parallel with processing the data.

TOP

What makes ExaStat so fast?

Among the main reasons for its exceptional speed are:

  • It pre-analyzes the required computations to figure out the minimal number of steps required. Multiple models are analyzed jointly.
  • It minimizes the number of copies of data that are required during a computation.
  • It minimizes the number of data conversions required.
  • It handles categorical data extremely efficiently.
  • It is coded very efficiently; it makes heavy use of C++ templates, which can generate extremely fast code.
  • It uses the uses the Intel ® Math Kernel Library, the high performance BLAS & LAPACK library, for matrix operations where appropriate. We use the last free version (version 5.2) that Intel made available. Even greater performance is undoubtably possible with their current version (http://www.intel.com/cd/software/products/asmo-na/eng/perflib/mkl/index.htm).
  •  

TOP

Is there any limit on the amount of data? There is no fixed limit on the number of observations (although as currently compiled the limit is about a billion billion rows). The number of variables (columns) in a set of analyses is limited by RAM.  A regression with several thousand variables is easily feasible on a workstation with 512 MB of RAM.

TOP

On what operating systems does ExaStat run?

It written in C++, and is designed to be portable. It is currently compiled and tested on Windows.

TOP

 

What kind of data can ExaStat use?

ExaStat has a generic programming interface for accessing data from standard database drivers. On Windows, it uses Microsoft’s ADO to provide a reading capability for all common databases.

It can handle all of the standard data types (integer, floating point, string, and so on).

Data can reside in a central database, can be distributed across computers, or can be streaming.

In addition, ExaStat has its own data file format that allows extremely fast read/write access to massive data sets. This format also allows both columns and rows to be added without re-writing the entire data file, and allows both columns and rows to be read selectively.

TOP

What type of computer configuration is required?

ExaStat can, of course, be run on a stand-alone computer.

For distributed computing, ExaStat uses a client/server model. The computers can be connected by any type of network, including the internet. The computers do not all have to be identical. There is no fixed limit to the number of servers that can be used.

One computer is designated as the client, and other computers as servers. The same software is used on the client and the servers.

Each server needs to have fast access to the data that it will be processing, but the connection speed between the client and each server is not typically an important factor in the time required for an analysis.

The data is processed in blocks on each server, and beyond a certain point adding RAM to servers does not speed up computation time. However, there can be an advantage in having a multi-processor client with lots of RAM.

TOP

What kind of an interface does ExaStat have?

It uses a programming interface that is similar to the interfaces provided in interactive data analysis and mathematical packages such as S-PLUS, Gauss, MATLAB, and Ox.

ExaStat uses a simplified matrix- and object-oriented subset of C++ for standard data manipulation. Memory allocation and deallocation are all handled automatically. There is no need to use “malloc” or “new” or pointers. For example:

///////////////////////////////////////////////////
// Generate some data satisfying the linear
// model: y = X*b + e and estimate the
// parameters b using least squares regression
///////////////////////////////////////////////////
int iR = 1000; // number of rows

// initialize a 4x1 matrix of "parameters"
DoubleMatrix b = "2.1, 3.2, 4.3, 5.4";

// Create some "independent" variables by pasting
// (concatenating) a column of ones to the left
// of a matrix of random normal variables
DoubleMatrix X = Ones(iR) & RandomNormal( iR, 3 );

// generate a random Y = X*b plus random noise
DoubleMatrix Y = X*b + RandomNormal( iR );

// brute force OLS estimation of b using text book formula
DoubleMatrix beta = Inv(X.T()*X)*X.T()*Y;

// OLS regression of Y on X using function
RegressionResults rr = Regression( Y, X );
rr.Display(); // display the results

The computations in the example above will by default be parallelized where appropriate.
 
The following example shows sample syntax for using some of the built-in distributed data analysis routines:

///////////////////////////////////////////////////
// Simple example of specifying a set of models
// to be distributed across a set of servers...
///////////////////////////////////////////////////
// Construct an analysis object
Analysis myAnalysis;

// Specify an ExaStat data file as a data source
myAnalysis.AddDataFileSource("MyDataSet");
// Specify a predefined set of servers to use
myAnalysis.AddServerArray("ServerArray1");

// Now specify some analyses...
// Compute descriptive stats for all variables
myAnalysis.AddDescriptiveStats();
// Regress y1 on x1, x2, and the interaction of
// categorical variables c1 and c2. Appropriate
// “dummy” variables are automatically generated.
myAnalysis.AddRegression( "y1 ~ x1 + x2 + c1:c2" );
// Compute a cube defined by the interaction of
// four categorical variables
myAnalysis.AddCube( "c1 : c2 : c3 : c4" );

// Do the distributed, parallelized computations
myAnalysis.Compute();
// Display all of the results
myAnalysis.DisplayResults();
// Write the analysis object to a file
myAnalysis.Write("MyAnalysis");

Code like that shown above can either be interpreted with the integrated C++ interpreter, or it can be compiled and executed.

The full C++ language is available if desired (most, but not all, C++ code can be interpreted).

TOP

Does ExaStat allow data to be cleaned and transformed?

Yes, it provides extensive capabilities for manipulating data, and these capabilities can easily be extended using a simplified matrix- and object-oriented syntax. These capabilities are automatically parallelized and distributable.

Data transformations can be performed during the course of an analysis, or they can be done in a separate pass through the data.

TOP

What kinds of analyses can ExaStat do?

ExaStat provides a large set of vector, matrix, and cube functions and operators: arithmetic, relational, logical, transcendental, subsetting, decomposition and solution, aggregation, set, data generation, and so on. Each function/operator operates on all relevant data types, and is parallelized where appropriate.

The following general categories of distributed data analysis algorithms are currently implemented:

  • descriptive statistics (means, variances, mins, maxs, and number missing, within and across categories)
  • cubes and crosstabs (categorical counts and sums, frequencies, hypercubes)
  • least squares (regression)
  • logistic regression
  • sparse boolean counts and logistic regression

Weights can be specified for all of these algorithms.
 
These algorithms are distributable and are also updateable. That is, new data or new results may be used to update existing results. This means, for example, that results can be updated continuously from streaming data, and that this month’s results can update the year-to-date results.

Additional algorithms such as lnon-linear least squares or artificial neural net models can be added relatively easily. 

ExaStat is designed to be a platform for easily writing efficient, distributed and parallelized statistical and data mining algorithms. These algorithms can be written using a simplified matrix- and object-oriented syntax, building upon the existing functionality.

TOP

How is ExaStat implemented?

ExaStat has a state-of-the-art design and implementation:

  • it has an object-oriented design which is implemented in portable C++
  • it makes extensive use of C++ templates, which produces extremely fast code; templates also make it so that extending an algorithm to handle a new data type may just be a matter defining the data type and re-compiling
  • it includes an integrated C++ interpreter, allowing ExaStat to be extended using either compiled or interpreted code, and allowing C++ code to be distributed dynamically across computers
  • it uses state-of-the-art numerical algorithms
  • it provides full access to the BLAS and LAPACK and allows access to any other linkable libraries
  • it is designed to handle missing data at a very low level
  • it is designed from the ground up to deal with exceptions
  • it provides a simple-to-use and robust framework for implementing threaded (parallelized) and distributed statistical and data mining algorithms

TOP


 

Copyright © 2006-2008, ExaMetrix. All Rights Reserved. Contact Us