cpython-source-deps/examples_c/csv/README

/*-
 * See the file LICENSE for redistribution information.
 *
 * Copyright (c) 2005,2008 Oracle.  All rights reserved.
 *
 * $Id: README,v 1.22 2008/01/08 20:58:23 bostic Exp $
 */

The "comma-separated value" (csv) directory is a suite of three programs:

	csv_code:  write "helper" code on which to build applications,
	csv_load:  import csv files into a Berkeley DB database,
	csv_query: query databases created by csv_load.

The goal is to allow programmers to easily build applications for using
csv databases.

You can build the three programs, and run a sample application in this
directory.

First, there's the sample.csv file:

	Adams,Bob,01/02/03,green,apple,37
	Carter,Denise Ann,04/05/06,blue,banana,38
	Eidel,Frank,07/08/09,red,cherry,38
	Grabel,Harriet,10/11/12,purple,date,40
	Indals,Jason,01/03/05,pink,orange,32
	Kilt,Laura,07/09/11,yellow,grape,38
	Moreno,Nancy,02/04/06,black,strawberry,38
	Octon,Patrick,08/10/12,magenta,kiwi,15

The fields are:
	Last name,
	First name,
	Birthdate,
	Favorite color,
	Favorite fruit,
	Age

Second, there's a "description" of that csv file in sample.desc:

	version 1 {
		LastName	string
		FirstName	string
		BirthDate
		Color		string index
		Fruit		string index
		Age		unsigned_long index
	}

The DESCRIPTION file maps one-to-one to the fields in the csv file, and
provides a data type for any field the application wants to use.  (If
the application doesn't care about a field, don't specify a data type
and the csv code will ignore it.)  The string "index" specifies there
should be a secondary index based on the field.

The "field" names in the DESCRIPTION file don't have to be the same as
the ones in the csv file (and, as they may not have embedded spaces,
probably won't be).

To build in the sample directory, on POSIX-like systems, type "make".
This first builds the program csv_code, which it then run, with the file
DESCRIPTION as an input.  Running csv_code creates two additional files:
csv_local.c and csv_local.h.  Those two files are then used as part of
the build process for two more programs: csv_load and csv_query.

You can load now load the csv file into a Berkeley DB database with the
following command:

	% ./csv_load -h TESTDIR < sample.csv

The csv_load command will create a directory and four databases:

	primary		primary database
	Age		secondary index on Age field
	Color		secondary index on Color field
	Fruit		secondary index on Fruit field

You can then query the database:

	% ./csv_query -h TESTDIR
	Query: id=2
	Record: 2:
		LastName: Carter
		FirstName: Denise
		Color: blue
		Fruit: banana
		Age: 38
	Query: color==green
	Record: 1:
		LastName: Adams
		FirstName: Bob
		Color: green
		Fruit: apple
		Age: 37

and so on.

The csv_code process also creates source code modules that support
building your own applications based on this database.  First, there
is the local csv_local.h include file:

	/*
	 *  DO NOT EDIT: automatically built by csv_code.
	 *
	 * Record structure.
	 */
	typedef struct __DbRecord {
		u_int32_t	 recno;		/* Record number */

		/*
		 * Management fields
		 */
		void		*raw;		/* Memory returned by DB */
		char		*record;	/* Raw record */
		size_t		 record_len;	/* Raw record length */

		u_int32_t	 field_count;	/* Field count */
		u_int32_t	 version;	/* Record version */

		u_int32_t	*offset;	/* Offset table */

		/*
		 * Indexed fields
		 */
	#define	CSV_INDX_LASTNAME	1
		char		*LastName;

	#define	CSV_INDX_FIRSTNAME	2
		char		*FirstName;

	#define	CSV_INDX_COLOR	4
		char		*Color;

	#define	CSV_INDX_FRUIT	5
		char		*Fruit;

	#define	CSV_INDX_AGE	6
		u_long		 Age;
	} DbRecord;

This defines the DbRecord structure that is the primary object for this
csv file.  As you can see, the intersting fields in the csv file have
mappings in this structure.

Also, there are routines in the Dbrecord.c file your application can use
to handle DbRecord structures.  When you retrieve a record from the
database the DbRecord structure will be filled in based on that record.

Here are the helper routines:

	int
	DbRecord_print(DbRecord *recordp, FILE *fp)
		Display the contents of a DbRecord structure to the specified
		output stream.

	int
	DbRecord_init(const DBT *key, DBT *data, DbRecord *recordp)
		Fill in a DbRecord from a returned database key/data pair.

	int
	DbRecord_read(u_long key, DbRecord *recordp)
		Read the specified record (DbRecord_init will be called
		to fill in the DbRecord).

	int
	DbRecord_discard(DbRecord *recordp)
		Discard the DbRecord structure (must be called after the
		DbRecord_read function), when the application no longer
		needs the returned DbRecord.

	int
	DbRecord_search_field_name(char *field, char *value, OPERATOR op)
		Display the DbRecords where the field (named by field) has
		the specified relationship to the value.  For example:

		DbRecord_search_field_name("Age", "35", GT)

		would search for records with a "Age" field greater than
		35.

	int
	DbRecord_search_field_number(
	    u_int32_t fieldno, char *value, OPERATOR op)
		Display the DbRecords where the field (named by field)
		has the specified relationship to the value.  The field
		number used as an argument comes from the csv_local.h
		file, for example, CSV_INDX_AGE is the field index for
		the "Age" field in this csv file.  For example:

		DbRecord_search_field_number(CSV_INDX_AGE, 35, GT)

		would search for records with a "Age" field greater than
		35.

	Currently, the csv code only supports three types of data:
	strings, unsigned longs and doubles.  Others can easily be
	added.

The usage of the csv_code program is as follows:

	usage: csv_code [-v] [-c source-file] [-f input] [-h header-file]
		-c	output C source code file
		-h	output C header file
		-f	input file
		-v	verbose (defaults to off)

	-c      A file to which to write the C language code.  By default,
		the file "csv_local.c" is used.

	-f      A file to read for a description of the fields in the
		csv file.  By default, csv_code reads from stdin.

	-h	A file to which to write the C language header structures.
		By default, the file "csv_local.h" is used.

	-v      The -v verbose flag outputs potentially useful debugging
		information.

There are two applications built on top of the code produced by
csv_code, csv_load and csv_query.

The usage of the csv_load program is as follows:

	usage: csv_load [-v] [-F format] [-f csv-file] [-h home] [-V version]
		-F	format (currently supports "excel")
		-f      input file
		-h      database environment home directory
		-v      verbose (defaults to off)

	-F	See "Input format" below.

	-f      If an input file is specified using the -f flag, the file
		is read and the records in the file are stored into the
		database.  By default, csv_load reads from stdin.

	-h      If a database environment home directory is specified
		using the -h flag, that directory is used as the
		Berkeley DB directory.  The default for -h is the
		current working directory or the value of the DB_HOME
		environment variable.

	-V	Specify a version number for the input (the default is 1).

	-v      The -v verbose flag outputs potentially useful debugging
		information.  It can be specified twice for additional
		information.

The usage of csv_query program is as follows:

	usage: csv_query [-v] [-c cmd] [-h home]

	-c      A command to run, otherwise csv_query will enter
		interactive mode and prompt for user input.

	-h      If a database environment home directory is specified
		using the -h flag, that directory is used as the
		Berkeley DB directory.  The default for -h is the
		current working directory or the value of the DB_HOME
		environment variable.

	-v      The -v verbose flag outputs potentially useful debugging
		information.  It can be specified twice for additional
		information.

The query program currently supports the following commands:

	?               Display help screen
	exit            Exit program
	fields          Display list of field names
	help            Display help screen
	quit            Exit program
	version         Display database format version
	field[op]value  Display fields by value (=, !=, <, <=, >, >=, ~, !~)

The "field[op]value" command allows you to specify a field and a
relationship to a value.  For example, you could run the query:

	csv_query -c "price < 5"

to list all of the records with a "price" field less than "5".

Field names and all string comparisons are case-insensitive.

The operators ~ and !~ do match/no-match based on the IEEE Std 1003.2
(POSIX.2) Basic Regular Expression standard.

As a special case, every database has the field "Id", which matches the
record number of the primary key.

Input format:
	The input to the csv_load utility is a text file, containing
	lines of comma-separated fields.

	Blank lines are ignored.  All non-blank lines must be comma-separated
	lists of fields.

	By default:
		<nul> (\000) bytes and unprintable characters are stripped,
		input lines are <nl> (\012) separated,
		commas cannot be escaped.

	If "-F excel" is specified:
		<nul> (\000) bytes and unprintable characters are stripped,
		input lines are <cr> (\015) separated,
		<nl> bytes (\012) characters are stripped from the input,
		commas surrounded by double-quote character (") are not
		treated as field separators.

Storage format:
	Records in the primary database are stored with a 32-bit unsigned
	record number as the key.

	Key/Data pair 0 is of the format:
		[version]		32-bit unsigned int
		[field count]		32-bit unsigned int
		[raw record]		byte array

	For example:
		[1]
		[5]
		[field1,field2,field3,field4,field5]

	All other Key/Data pairs are of the format:
		[version]		32-bit unsigned int
		[offset to field 1]	32-bit unsigned int
		[offset to field 2]	32-bit unsigned int
		[offset to field 3]	32-bit unsigned int
		...			32-bit unsigned int
		[offset to field N]	32-bit unsigned int
		[offset past field N]	32-bit unsigned int
		[raw record]		byte array

	For example:
		[1]
		[0]
		[2]
		[5]
		[9]
		[14]
		[19]
		[a,ab,abc,abcd,abcde]
		 012345678901234567890		<< byte offsets
		 0	   1	     2

	So, field 3 of the data can be directly accessed by using
	the "offset to field 3", and the length of the field is
	the "((offset to field 4) - (offset to field 3)) - 1".

Limits:
	The csv program stores the primary key in a 32-bit unsigned
	value, limiting the number of records in the database.  New
	records are inserted after the last existing record, that is,
	new records are not inserted into gaps left by any deleted
	records.  This will limit the total number of records stored in
	any database.

Versioning:
	Versioning is when a database supports multiple versions of the
	records.  This is likely to be necessary when dealing with large
	applications and databases, as record fields change over time.

	The csv application suite does not currently support versions,
	although all of the necessary hooks are there.

	The way versioning will work is as follows:

	The XXX.desc file needs to support multiple version layouts.

	The generated C language structure defined should be a superset
	of all of the interesting fields from all of the version
	layouts, regardless of which versions of the csv records those
	fields exist in.

	When the csv layer is asked for a record, the record's version
	will provide a lookup into a separate database of field lists.
	That is, there will be another database which has key/data pairs
	where the key is a version number, and the data is the field
	list.  At that point, it's relatively easy to map the fields
	to the structure as is currently done, except that some of the
	fields may not be filled in.

	To determine if a field is filled in, in the structure, the
	application has to have an out-of-band value to put in that
	field during DbRecord initialization.  If that's a problem, the
	alternative would be to add an additional field for each listed
	field -- if the additional field is set to 1, the listed field
	has been filled in, otherwise it hasn't.  The csv code will
	support the notion of required fields, so in most cases the
	application won't need to check before simply using the field,
	it's only if a field isn't required and may be filled in that
	the check will be necessary.

TODO:
	Csv databases are not portable between machines of different
	byte orders.  To make them portable, all of the 32-bit unsigned
	int fields currently written into the database should be
	converted to a standard byte order.  This would include the
	version number and field count in the column-map record, and the
	version and field offsets in the other records.

	Add Extended RE string matches.

	Add APIs to replace the reading of a schema file, allow users to
	fill in a DbRecord structure and do a put on it.  (Hard problem:
	how to flag fields that aren't filled in.)

	Add a second sample file, and write the actual versioning code.