409 lines
13 KiB
Plaintext
409 lines
13 KiB
Plaintext
/*-
|
|
* See the file LICENSE for redistribution information.
|
|
*
|
|
* Copyright (c) 2005,2008 Oracle. All rights reserved.
|
|
*
|
|
* $Id: README,v 1.22 2008/01/08 20:58:23 bostic Exp $
|
|
*/
|
|
|
|
The "comma-separated value" (csv) directory is a suite of three programs:
|
|
|
|
csv_code: write "helper" code on which to build applications,
|
|
csv_load: import csv files into a Berkeley DB database,
|
|
csv_query: query databases created by csv_load.
|
|
|
|
The goal is to allow programmers to easily build applications for using
|
|
csv databases.
|
|
|
|
You can build the three programs, and run a sample application in this
|
|
directory.
|
|
|
|
First, there's the sample.csv file:
|
|
|
|
Adams,Bob,01/02/03,green,apple,37
|
|
Carter,Denise Ann,04/05/06,blue,banana,38
|
|
Eidel,Frank,07/08/09,red,cherry,38
|
|
Grabel,Harriet,10/11/12,purple,date,40
|
|
Indals,Jason,01/03/05,pink,orange,32
|
|
Kilt,Laura,07/09/11,yellow,grape,38
|
|
Moreno,Nancy,02/04/06,black,strawberry,38
|
|
Octon,Patrick,08/10/12,magenta,kiwi,15
|
|
|
|
The fields are:
|
|
Last name,
|
|
First name,
|
|
Birthdate,
|
|
Favorite color,
|
|
Favorite fruit,
|
|
Age
|
|
|
|
Second, there's a "description" of that csv file in sample.desc:
|
|
|
|
version 1 {
|
|
LastName string
|
|
FirstName string
|
|
BirthDate
|
|
Color string index
|
|
Fruit string index
|
|
Age unsigned_long index
|
|
}
|
|
|
|
The DESCRIPTION file maps one-to-one to the fields in the csv file, and
|
|
provides a data type for any field the application wants to use. (If
|
|
the application doesn't care about a field, don't specify a data type
|
|
and the csv code will ignore it.) The string "index" specifies there
|
|
should be a secondary index based on the field.
|
|
|
|
The "field" names in the DESCRIPTION file don't have to be the same as
|
|
the ones in the csv file (and, as they may not have embedded spaces,
|
|
probably won't be).
|
|
|
|
To build in the sample directory, on POSIX-like systems, type "make".
|
|
This first builds the program csv_code, which it then run, with the file
|
|
DESCRIPTION as an input. Running csv_code creates two additional files:
|
|
csv_local.c and csv_local.h. Those two files are then used as part of
|
|
the build process for two more programs: csv_load and csv_query.
|
|
|
|
You can load now load the csv file into a Berkeley DB database with the
|
|
following command:
|
|
|
|
% ./csv_load -h TESTDIR < sample.csv
|
|
|
|
The csv_load command will create a directory and four databases:
|
|
|
|
primary primary database
|
|
Age secondary index on Age field
|
|
Color secondary index on Color field
|
|
Fruit secondary index on Fruit field
|
|
|
|
You can then query the database:
|
|
|
|
% ./csv_query -h TESTDIR
|
|
Query: id=2
|
|
Record: 2:
|
|
LastName: Carter
|
|
FirstName: Denise
|
|
Color: blue
|
|
Fruit: banana
|
|
Age: 38
|
|
Query: color==green
|
|
Record: 1:
|
|
LastName: Adams
|
|
FirstName: Bob
|
|
Color: green
|
|
Fruit: apple
|
|
Age: 37
|
|
|
|
and so on.
|
|
|
|
The csv_code process also creates source code modules that support
|
|
building your own applications based on this database. First, there
|
|
is the local csv_local.h include file:
|
|
|
|
/*
|
|
* DO NOT EDIT: automatically built by csv_code.
|
|
*
|
|
* Record structure.
|
|
*/
|
|
typedef struct __DbRecord {
|
|
u_int32_t recno; /* Record number */
|
|
|
|
/*
|
|
* Management fields
|
|
*/
|
|
void *raw; /* Memory returned by DB */
|
|
char *record; /* Raw record */
|
|
size_t record_len; /* Raw record length */
|
|
|
|
u_int32_t field_count; /* Field count */
|
|
u_int32_t version; /* Record version */
|
|
|
|
u_int32_t *offset; /* Offset table */
|
|
|
|
/*
|
|
* Indexed fields
|
|
*/
|
|
#define CSV_INDX_LASTNAME 1
|
|
char *LastName;
|
|
|
|
#define CSV_INDX_FIRSTNAME 2
|
|
char *FirstName;
|
|
|
|
#define CSV_INDX_COLOR 4
|
|
char *Color;
|
|
|
|
#define CSV_INDX_FRUIT 5
|
|
char *Fruit;
|
|
|
|
#define CSV_INDX_AGE 6
|
|
u_long Age;
|
|
} DbRecord;
|
|
|
|
This defines the DbRecord structure that is the primary object for this
|
|
csv file. As you can see, the intersting fields in the csv file have
|
|
mappings in this structure.
|
|
|
|
Also, there are routines in the Dbrecord.c file your application can use
|
|
to handle DbRecord structures. When you retrieve a record from the
|
|
database the DbRecord structure will be filled in based on that record.
|
|
|
|
Here are the helper routines:
|
|
|
|
int
|
|
DbRecord_print(DbRecord *recordp, FILE *fp)
|
|
Display the contents of a DbRecord structure to the specified
|
|
output stream.
|
|
|
|
int
|
|
DbRecord_init(const DBT *key, DBT *data, DbRecord *recordp)
|
|
Fill in a DbRecord from a returned database key/data pair.
|
|
|
|
int
|
|
DbRecord_read(u_long key, DbRecord *recordp)
|
|
Read the specified record (DbRecord_init will be called
|
|
to fill in the DbRecord).
|
|
|
|
int
|
|
DbRecord_discard(DbRecord *recordp)
|
|
Discard the DbRecord structure (must be called after the
|
|
DbRecord_read function), when the application no longer
|
|
needs the returned DbRecord.
|
|
|
|
int
|
|
DbRecord_search_field_name(char *field, char *value, OPERATOR op)
|
|
Display the DbRecords where the field (named by field) has
|
|
the specified relationship to the value. For example:
|
|
|
|
DbRecord_search_field_name("Age", "35", GT)
|
|
|
|
would search for records with a "Age" field greater than
|
|
35.
|
|
|
|
int
|
|
DbRecord_search_field_number(
|
|
u_int32_t fieldno, char *value, OPERATOR op)
|
|
Display the DbRecords where the field (named by field)
|
|
has the specified relationship to the value. The field
|
|
number used as an argument comes from the csv_local.h
|
|
file, for example, CSV_INDX_AGE is the field index for
|
|
the "Age" field in this csv file. For example:
|
|
|
|
DbRecord_search_field_number(CSV_INDX_AGE, 35, GT)
|
|
|
|
would search for records with a "Age" field greater than
|
|
35.
|
|
|
|
Currently, the csv code only supports three types of data:
|
|
strings, unsigned longs and doubles. Others can easily be
|
|
added.
|
|
|
|
The usage of the csv_code program is as follows:
|
|
|
|
usage: csv_code [-v] [-c source-file] [-f input] [-h header-file]
|
|
-c output C source code file
|
|
-h output C header file
|
|
-f input file
|
|
-v verbose (defaults to off)
|
|
|
|
-c A file to which to write the C language code. By default,
|
|
the file "csv_local.c" is used.
|
|
|
|
-f A file to read for a description of the fields in the
|
|
csv file. By default, csv_code reads from stdin.
|
|
|
|
-h A file to which to write the C language header structures.
|
|
By default, the file "csv_local.h" is used.
|
|
|
|
-v The -v verbose flag outputs potentially useful debugging
|
|
information.
|
|
|
|
There are two applications built on top of the code produced by
|
|
csv_code, csv_load and csv_query.
|
|
|
|
The usage of the csv_load program is as follows:
|
|
|
|
usage: csv_load [-v] [-F format] [-f csv-file] [-h home] [-V version]
|
|
-F format (currently supports "excel")
|
|
-f input file
|
|
-h database environment home directory
|
|
-v verbose (defaults to off)
|
|
|
|
-F See "Input format" below.
|
|
|
|
-f If an input file is specified using the -f flag, the file
|
|
is read and the records in the file are stored into the
|
|
database. By default, csv_load reads from stdin.
|
|
|
|
-h If a database environment home directory is specified
|
|
using the -h flag, that directory is used as the
|
|
Berkeley DB directory. The default for -h is the
|
|
current working directory or the value of the DB_HOME
|
|
environment variable.
|
|
|
|
-V Specify a version number for the input (the default is 1).
|
|
|
|
-v The -v verbose flag outputs potentially useful debugging
|
|
information. It can be specified twice for additional
|
|
information.
|
|
|
|
The usage of csv_query program is as follows:
|
|
|
|
usage: csv_query [-v] [-c cmd] [-h home]
|
|
|
|
-c A command to run, otherwise csv_query will enter
|
|
interactive mode and prompt for user input.
|
|
|
|
-h If a database environment home directory is specified
|
|
using the -h flag, that directory is used as the
|
|
Berkeley DB directory. The default for -h is the
|
|
current working directory or the value of the DB_HOME
|
|
environment variable.
|
|
|
|
-v The -v verbose flag outputs potentially useful debugging
|
|
information. It can be specified twice for additional
|
|
information.
|
|
|
|
The query program currently supports the following commands:
|
|
|
|
? Display help screen
|
|
exit Exit program
|
|
fields Display list of field names
|
|
help Display help screen
|
|
quit Exit program
|
|
version Display database format version
|
|
field[op]value Display fields by value (=, !=, <, <=, >, >=, ~, !~)
|
|
|
|
The "field[op]value" command allows you to specify a field and a
|
|
relationship to a value. For example, you could run the query:
|
|
|
|
csv_query -c "price < 5"
|
|
|
|
to list all of the records with a "price" field less than "5".
|
|
|
|
Field names and all string comparisons are case-insensitive.
|
|
|
|
The operators ~ and !~ do match/no-match based on the IEEE Std 1003.2
|
|
(POSIX.2) Basic Regular Expression standard.
|
|
|
|
As a special case, every database has the field "Id", which matches the
|
|
record number of the primary key.
|
|
|
|
Input format:
|
|
The input to the csv_load utility is a text file, containing
|
|
lines of comma-separated fields.
|
|
|
|
Blank lines are ignored. All non-blank lines must be comma-separated
|
|
lists of fields.
|
|
|
|
By default:
|
|
<nul> (\000) bytes and unprintable characters are stripped,
|
|
input lines are <nl> (\012) separated,
|
|
commas cannot be escaped.
|
|
|
|
If "-F excel" is specified:
|
|
<nul> (\000) bytes and unprintable characters are stripped,
|
|
input lines are <cr> (\015) separated,
|
|
<nl> bytes (\012) characters are stripped from the input,
|
|
commas surrounded by double-quote character (") are not
|
|
treated as field separators.
|
|
|
|
Storage format:
|
|
Records in the primary database are stored with a 32-bit unsigned
|
|
record number as the key.
|
|
|
|
Key/Data pair 0 is of the format:
|
|
[version] 32-bit unsigned int
|
|
[field count] 32-bit unsigned int
|
|
[raw record] byte array
|
|
|
|
For example:
|
|
[1]
|
|
[5]
|
|
[field1,field2,field3,field4,field5]
|
|
|
|
All other Key/Data pairs are of the format:
|
|
[version] 32-bit unsigned int
|
|
[offset to field 1] 32-bit unsigned int
|
|
[offset to field 2] 32-bit unsigned int
|
|
[offset to field 3] 32-bit unsigned int
|
|
... 32-bit unsigned int
|
|
[offset to field N] 32-bit unsigned int
|
|
[offset past field N] 32-bit unsigned int
|
|
[raw record] byte array
|
|
|
|
For example:
|
|
[1]
|
|
[0]
|
|
[2]
|
|
[5]
|
|
[9]
|
|
[14]
|
|
[19]
|
|
[a,ab,abc,abcd,abcde]
|
|
012345678901234567890 << byte offsets
|
|
0 1 2
|
|
|
|
So, field 3 of the data can be directly accessed by using
|
|
the "offset to field 3", and the length of the field is
|
|
the "((offset to field 4) - (offset to field 3)) - 1".
|
|
|
|
Limits:
|
|
The csv program stores the primary key in a 32-bit unsigned
|
|
value, limiting the number of records in the database. New
|
|
records are inserted after the last existing record, that is,
|
|
new records are not inserted into gaps left by any deleted
|
|
records. This will limit the total number of records stored in
|
|
any database.
|
|
|
|
Versioning:
|
|
Versioning is when a database supports multiple versions of the
|
|
records. This is likely to be necessary when dealing with large
|
|
applications and databases, as record fields change over time.
|
|
|
|
The csv application suite does not currently support versions,
|
|
although all of the necessary hooks are there.
|
|
|
|
The way versioning will work is as follows:
|
|
|
|
The XXX.desc file needs to support multiple version layouts.
|
|
|
|
The generated C language structure defined should be a superset
|
|
of all of the interesting fields from all of the version
|
|
layouts, regardless of which versions of the csv records those
|
|
fields exist in.
|
|
|
|
When the csv layer is asked for a record, the record's version
|
|
will provide a lookup into a separate database of field lists.
|
|
That is, there will be another database which has key/data pairs
|
|
where the key is a version number, and the data is the field
|
|
list. At that point, it's relatively easy to map the fields
|
|
to the structure as is currently done, except that some of the
|
|
fields may not be filled in.
|
|
|
|
To determine if a field is filled in, in the structure, the
|
|
application has to have an out-of-band value to put in that
|
|
field during DbRecord initialization. If that's a problem, the
|
|
alternative would be to add an additional field for each listed
|
|
field -- if the additional field is set to 1, the listed field
|
|
has been filled in, otherwise it hasn't. The csv code will
|
|
support the notion of required fields, so in most cases the
|
|
application won't need to check before simply using the field,
|
|
it's only if a field isn't required and may be filled in that
|
|
the check will be necessary.
|
|
|
|
TODO:
|
|
Csv databases are not portable between machines of different
|
|
byte orders. To make them portable, all of the 32-bit unsigned
|
|
int fields currently written into the database should be
|
|
converted to a standard byte order. This would include the
|
|
version number and field count in the column-map record, and the
|
|
version and field offsets in the other records.
|
|
|
|
Add Extended RE string matches.
|
|
|
|
Add APIs to replace the reading of a schema file, allow users to
|
|
fill in a DbRecord structure and do a put on it. (Hard problem:
|
|
how to flag fields that aren't filled in.)
|
|
|
|
Add a second sample file, and write the actual versioning code.
|