How To¶
Use of the library starts with creating a TableReader
object.
import tftables
reader = tftables.open_file(filename="/path/to/h5/file", batch_size=10)
Here the batch size is specified as an argument to the open_file
function. The batch_size defines the length
(in the outer dimension) of the elements (batches) returned by reader
.
Accessing a single array¶
Suppose you only want to read a single array from your HDF5 file. Doing this is quite straight-forward.
Start by getting a tensorflow placeholder for your batch from reader
.
array_batch_placeholder = reader.get_batch(
path = '/h5/path', # This is the path to your array inside the HDF5 file.
cyclic = True, # In cyclic access, when the reader gets to the end of the
# array, it will wrap back to the beginning and continue.
ordered = False # The reader will not require the rows of the array to be
# returned in the same order as on disk.
)
# You can transform the batch however you like now.
# For example, casting it to floats.
array_batch_float = tf.to_float(array_batch_placeholder)
# The data can now be fed into your network
result = my_network(array_batch_float)
with tf.Session() as sess:
# The feed method provides a generator that returns
# feed_dict's containing batches from your HDF5 file.
for i, feed_dict in enumerate(reader.feed()):
sess.run(result, feed_dict=feed_dict)
if i >= N:
break
# Finally, the reader should be closed.
reader.close()
Note that be default, the ordered
argument to get_batch
is set to True
. If you require the rows of the
array to be returned in the same order as they are on disk, then you should leave it as ordered = True
.
However, this may result in a performance penalty. In machine learning, rows of a dataset often represent
independent examples, or data points. Thus their ordering is not important.
Accessing a single table¶
When reading from a table, the get_batch
method returns a dictionary. The columns of the table form the keys
of this dictionary, and the values are tensorflow placeholders for batches of each column. If one of the columns has
a compound datatype, then its corresponding value in the dictionary will itself be a dictionary. In this way,
recursive compound datatypes will give recursive dictionaries.
For example, if your table just had two columns, named label
and data
, then you could use:
table_batch = reader.get_batch(
path = '/path/to/table',
cyclic = True,
ordered = False
)
label_batch = table_batch['label']
data_batch = table_batch['data']
If your table was a bit more complicated, with columns named label
and value
. And the value
column has
a compound type with fields named image
and lidar
, then you could use:
table_batch = reader.get_batch(
path = '/path/to/complex_table',
cyclic = True,
ordered = False
)
label_batch = table_batch['label']
value_batch = table_batch['value']
image_batch = value_batch['image']
lidar_batch = value_batch['lidar']
Using a FIFO queue¶
Copying data to the GPU through a feed_dict
is notoriously slow in Tensorflow. It is much faster to buffer
data in a queue. You are free to manage your own queues, but a helper class is included to make this task easier.
# As before
array_batch_placeholder = reader.get_batch(
path = '/h5/path',
cyclic = True,
ordered = False)
array_batch_float = tf.to_float(array_batch_placeholder)
# Now we create a FIFO Loader
loader = reader.get_fifoloader(
queue_size = 10, # The maximum number of elements that the
# internal Tensorflow queue should hold.
inputs = [array_batch_float], # A list of tensors that will be stored
# in the queue.
threads = 1 # The number of threads used to stuff the
# queue. If ordered access to a dataset
# was requested, then only 1 thread
# should be used.
)
# Batches can now be dequeued from the loader for use in your network.
array_batch_cpu = loader.dequeue()
result = my_network(array_batch_cpu)
with tf.Session() as sess:
# The loader needs to be started with your Tensorflow session.
loader.start(sess)
for i in range(N):
# You can now cleanly evaluate your network without a feed_dict.
sess.run(result)
# It also needs to be stopped for clean shutdown.
loader.stop(sess)
# Finally, the reader should be closed.
reader.close()
Non-cyclic access¶
If you are classifying a dataset, rather than training a model, then you probably only want to run through the
dataset once. This can be done by passing cyclic = False
to get_batch
. Once finished, the internal Tensorflow
queue will throw an instance of the tensorflow.errors.OutOfRangeError
exception to signal termination of the loop.
This can be caught manually with a try-catch block:
with tf.Session() as sess:
loader.start(sess)
try:
# Keep iterating until the exception breaks the loop
while True:
sess.run(result)
# Now silently catch the exception.
except tf.errors.OutOfRangeError:
pass
loader.stop(sess)
A slightly more elegant solution is to use a context manager supplied by the loader class:
with tf.Session() as sess:
loader.start(sess)
# This context manager suppresses the exception.
with loader.catch_termination():
# Keep iterating until the exception breaks the loop
while True:
sess.run(result)
loader.stop(sess)
Start stop context manager¶
In either cyclic or non-cyclic access, we can use a context manager to start and stop the loader class.
with tf.Session() as sess:
with loader.begin(sess):
# Loop
Quick access to a single dataset¶
It is highly recommended that you use a single dataset, this allows you to use unordered access which is a fastest way of reading data. If you have multiple sources of data, such as labels and images, then you should organise them into a table. This also has performance benefits due to the locality of the data.
When you only have one dataset, the function load_dataset
is provided to set up the reader and loader for you.
Any preprocessing that need to be done CPU side before loading into the queue can be written as a function that
generates a Tensorflow graph. This input transformation function is fed into load_dataset
as an argument.
The input transform function should return a list of tensors that will be stored in the queue. The input transform is required when the dataset is a table, as the dictionary needs to be turned into a list.
# This function preprocesses the batches before they
# are loaded into the internal queue.
# You can cast data, or do one-hot transforms.
# If the dataset is a table, this function is required.
def input_transform(tbl_batch):
labels = tbl_batch['label']
data = tbl_batch['data']
truth = tf.to_float(tf.one_hot(labels, num_labels, 1, 0))
data_float = tf.to_float(data)
return truth, data_float
# Open the HDF5 file and create a loader for a dataset.
# The batch_size defines the length (in the outer dimension)
# of the elements (batches) returned by the reader.
# Takes a function as input that pre-processes the data.
loader = tftables.load_dataset(filename='path/to/h5_file.h5',
dataset_path='/internal/h5/path',
input_transform=input_transform,
batch_size=20)
# To get the data, we dequeue it from the loader.
# Tensorflow tensors are returned in the same order as input_transformation
truth_batch, data_batch = loader.dequeue()
# The placeholder can then be used in your network
result = my_network(truth_batch, data_batch)
with tf.Session() as sess:
# This context manager starts and stops the internal threads and
# processes used to read the data from disk and store it in the queue.
with loader.begin(sess):
for _ in range(num_iterations):
sess.run(result)
When using load_dataset
the reader is automatically closed when the loader is stopped.
Accessing multiple datasets¶
If your HDF5 file has multiple datasets (multiple arrays, tables or both) then you should write a script to transform
it into a file with only a single table. If this isn’t possible, then you can access the datasets directly through
tftables
, but must do so using ordered access (otherwise the datasets can get out of sync).
# Use get_batch to access the table.
# Both datasets must be accessed in ordered mode.
table_batch_dict = reader.get_batch(
path = '/internal/h5_path/to/table',
ordered = True)
col_A_pl, col_B_pl = table_batch_dict['col_A'], table_batch_dict['col_B']
# Now use get_batch again to access an array.
# Both datasets must be accessed in ordered mode.
labels_batch = reader.get_batch('/my_label_array', ordered = True)
truth_batch = tf.one_hot(labels_batch, 2, 1, 0)
# The loader takes a list of tensors to be stored in the queue.
# When accessing in ordered mode, threads should be set to 1.
loader = reader.get_fifoloader(
queue_size = 10,
inputs = [truth_batch, col_A_pl, col_B_pl],
threads = 1)
# Batches are taken out of the queue using a dequeue operation.
# Tensors are returned in the order they were given when creating the loader.
truth_cpu, col_A_cpu, col_B_cpu = loader.dequeue()
# The dequeued data can then be used in your network.
result = my_network(truth_cpu, col_A_cpu, col_B_cpu)
with tf.Session() as sess:
with loader.begin(sess):
for _ in range(N):
sess.run(result)
reader.close()
Ordered access is enabled be default when using get_batch
as a safety measure. It is disabled when using
load_dataset
as that function restricts access to a single dataset.