mirror of https://github.com/Qortal/Brooklyn
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
556 lines
20 KiB
556 lines
20 KiB
.. SPDX-License-Identifier: GPL-2.0 |
|
|
|
======== |
|
ORANGEFS |
|
======== |
|
|
|
OrangeFS is an LGPL userspace scale-out parallel storage system. It is ideal |
|
for large storage problems faced by HPC, BigData, Streaming Video, |
|
Genomics, Bioinformatics. |
|
|
|
Orangefs, originally called PVFS, was first developed in 1993 by |
|
Walt Ligon and Eric Blumer as a parallel file system for Parallel |
|
Virtual Machine (PVM) as part of a NASA grant to study the I/O patterns |
|
of parallel programs. |
|
|
|
Orangefs features include: |
|
|
|
* Distributes file data among multiple file servers |
|
* Supports simultaneous access by multiple clients |
|
* Stores file data and metadata on servers using local file system |
|
and access methods |
|
* Userspace implementation is easy to install and maintain |
|
* Direct MPI support |
|
* Stateless |
|
|
|
|
|
Mailing List Archives |
|
===================== |
|
|
|
http://lists.orangefs.org/pipermail/devel_lists.orangefs.org/ |
|
|
|
|
|
Mailing List Submissions |
|
======================== |
|
|
|
[email protected] |
|
|
|
|
|
Documentation |
|
============= |
|
|
|
http://www.orangefs.org/documentation/ |
|
|
|
Running ORANGEFS On a Single Server |
|
=================================== |
|
|
|
OrangeFS is usually run in large installations with multiple servers and |
|
clients, but a complete filesystem can be run on a single machine for |
|
development and testing. |
|
|
|
On Fedora, install orangefs and orangefs-server:: |
|
|
|
dnf -y install orangefs orangefs-server |
|
|
|
There is an example server configuration file in |
|
/etc/orangefs/orangefs.conf. Change localhost to your hostname if |
|
necessary. |
|
|
|
To generate a filesystem to run xfstests against, see below. |
|
|
|
There is an example client configuration file in /etc/pvfs2tab. It is a |
|
single line. Uncomment it and change the hostname if necessary. This |
|
controls clients which use libpvfs2. This does not control the |
|
pvfs2-client-core. |
|
|
|
Create the filesystem:: |
|
|
|
pvfs2-server -f /etc/orangefs/orangefs.conf |
|
|
|
Start the server:: |
|
|
|
systemctl start orangefs-server |
|
|
|
Test the server:: |
|
|
|
pvfs2-ping -m /pvfsmnt |
|
|
|
Start the client. The module must be compiled in or loaded before this |
|
point:: |
|
|
|
systemctl start orangefs-client |
|
|
|
Mount the filesystem:: |
|
|
|
mount -t pvfs2 tcp://localhost:3334/orangefs /pvfsmnt |
|
|
|
Userspace Filesystem Source |
|
=========================== |
|
|
|
http://www.orangefs.org/download |
|
|
|
Orangefs versions prior to 2.9.3 would not be compatible with the |
|
upstream version of the kernel client. |
|
|
|
|
|
Building ORANGEFS on a Single Server |
|
==================================== |
|
|
|
Where OrangeFS cannot be installed from distribution packages, it may be |
|
built from source. |
|
|
|
You can omit --prefix if you don't care that things are sprinkled around |
|
in /usr/local. As of version 2.9.6, OrangeFS uses Berkeley DB by |
|
default, we will probably be changing the default to LMDB soon. |
|
|
|
:: |
|
|
|
./configure --prefix=/opt/ofs --with-db-backend=lmdb --disable-usrint |
|
|
|
make |
|
|
|
make install |
|
|
|
Create an orangefs config file by running pvfs2-genconfig and |
|
specifying a target config file. Pvfs2-genconfig will prompt you |
|
through. Generally it works fine to take the defaults, but you |
|
should use your server's hostname, rather than "localhost" when |
|
it comes to that question:: |
|
|
|
/opt/ofs/bin/pvfs2-genconfig /etc/pvfs2.conf |
|
|
|
Create an /etc/pvfs2tab file (localhost is fine):: |
|
|
|
echo tcp://localhost:3334/orangefs /pvfsmnt pvfs2 defaults,noauto 0 0 > \ |
|
/etc/pvfs2tab |
|
|
|
Create the mount point you specified in the tab file if needed:: |
|
|
|
mkdir /pvfsmnt |
|
|
|
Bootstrap the server:: |
|
|
|
/opt/ofs/sbin/pvfs2-server -f /etc/pvfs2.conf |
|
|
|
Start the server:: |
|
|
|
/opt/ofs/sbin/pvfs2-server /etc/pvfs2.conf |
|
|
|
Now the server should be running. Pvfs2-ls is a simple |
|
test to verify that the server is running:: |
|
|
|
/opt/ofs/bin/pvfs2-ls /pvfsmnt |
|
|
|
If stuff seems to be working, load the kernel module and |
|
turn on the client core:: |
|
|
|
/opt/ofs/sbin/pvfs2-client -p /opt/ofs/sbin/pvfs2-client-core |
|
|
|
Mount your filesystem:: |
|
|
|
mount -t pvfs2 tcp://`hostname`:3334/orangefs /pvfsmnt |
|
|
|
|
|
Running xfstests |
|
================ |
|
|
|
It is useful to use a scratch filesystem with xfstests. This can be |
|
done with only one server. |
|
|
|
Make a second copy of the FileSystem section in the server configuration |
|
file, which is /etc/orangefs/orangefs.conf. Change the Name to scratch. |
|
Change the ID to something other than the ID of the first FileSystem |
|
section (2 is usually a good choice). |
|
|
|
Then there are two FileSystem sections: orangefs and scratch. |
|
|
|
This change should be made before creating the filesystem. |
|
|
|
:: |
|
|
|
pvfs2-server -f /etc/orangefs/orangefs.conf |
|
|
|
To run xfstests, create /etc/xfsqa.config:: |
|
|
|
TEST_DIR=/orangefs |
|
TEST_DEV=tcp://localhost:3334/orangefs |
|
SCRATCH_MNT=/scratch |
|
SCRATCH_DEV=tcp://localhost:3334/scratch |
|
|
|
Then xfstests can be run:: |
|
|
|
./check -pvfs2 |
|
|
|
|
|
Options |
|
======= |
|
|
|
The following mount options are accepted: |
|
|
|
acl |
|
Allow the use of Access Control Lists on files and directories. |
|
|
|
intr |
|
Some operations between the kernel client and the user space |
|
filesystem can be interruptible, such as changes in debug levels |
|
and the setting of tunable parameters. |
|
|
|
local_lock |
|
Enable posix locking from the perspective of "this" kernel. The |
|
default file_operations lock action is to return ENOSYS. Posix |
|
locking kicks in if the filesystem is mounted with -o local_lock. |
|
Distributed locking is being worked on for the future. |
|
|
|
|
|
Debugging |
|
========= |
|
|
|
If you want the debug (GOSSIP) statements in a particular |
|
source file (inode.c for example) go to syslog:: |
|
|
|
echo inode > /sys/kernel/debug/orangefs/kernel-debug |
|
|
|
No debugging (the default):: |
|
|
|
echo none > /sys/kernel/debug/orangefs/kernel-debug |
|
|
|
Debugging from several source files:: |
|
|
|
echo inode,dir > /sys/kernel/debug/orangefs/kernel-debug |
|
|
|
All debugging:: |
|
|
|
echo all > /sys/kernel/debug/orangefs/kernel-debug |
|
|
|
Get a list of all debugging keywords:: |
|
|
|
cat /sys/kernel/debug/orangefs/debug-help |
|
|
|
|
|
Protocol between Kernel Module and Userspace |
|
============================================ |
|
|
|
Orangefs is a user space filesystem and an associated kernel module. |
|
We'll just refer to the user space part of Orangefs as "userspace" |
|
from here on out. Orangefs descends from PVFS, and userspace code |
|
still uses PVFS for function and variable names. Userspace typedefs |
|
many of the important structures. Function and variable names in |
|
the kernel module have been transitioned to "orangefs", and The Linux |
|
Coding Style avoids typedefs, so kernel module structures that |
|
correspond to userspace structures are not typedefed. |
|
|
|
The kernel module implements a pseudo device that userspace |
|
can read from and write to. Userspace can also manipulate the |
|
kernel module through the pseudo device with ioctl. |
|
|
|
The Bufmap |
|
---------- |
|
|
|
At startup userspace allocates two page-size-aligned (posix_memalign) |
|
mlocked memory buffers, one is used for IO and one is used for readdir |
|
operations. The IO buffer is 41943040 bytes and the readdir buffer is |
|
4194304 bytes. Each buffer contains logical chunks, or partitions, and |
|
a pointer to each buffer is added to its own PVFS_dev_map_desc structure |
|
which also describes its total size, as well as the size and number of |
|
the partitions. |
|
|
|
A pointer to the IO buffer's PVFS_dev_map_desc structure is sent to a |
|
mapping routine in the kernel module with an ioctl. The structure is |
|
copied from user space to kernel space with copy_from_user and is used |
|
to initialize the kernel module's "bufmap" (struct orangefs_bufmap), which |
|
then contains: |
|
|
|
* refcnt |
|
- a reference counter |
|
* desc_size - PVFS2_BUFMAP_DEFAULT_DESC_SIZE (4194304) - the IO buffer's |
|
partition size, which represents the filesystem's block size and |
|
is used for s_blocksize in super blocks. |
|
* desc_count - PVFS2_BUFMAP_DEFAULT_DESC_COUNT (10) - the number of |
|
partitions in the IO buffer. |
|
* desc_shift - log2(desc_size), used for s_blocksize_bits in super blocks. |
|
* total_size - the total size of the IO buffer. |
|
* page_count - the number of 4096 byte pages in the IO buffer. |
|
* page_array - a pointer to ``page_count * (sizeof(struct page*))`` bytes |
|
of kcalloced memory. This memory is used as an array of pointers |
|
to each of the pages in the IO buffer through a call to get_user_pages. |
|
* desc_array - a pointer to ``desc_count * (sizeof(struct orangefs_bufmap_desc))`` |
|
bytes of kcalloced memory. This memory is further intialized: |
|
|
|
user_desc is the kernel's copy of the IO buffer's ORANGEFS_dev_map_desc |
|
structure. user_desc->ptr points to the IO buffer. |
|
|
|
:: |
|
|
|
pages_per_desc = bufmap->desc_size / PAGE_SIZE |
|
offset = 0 |
|
|
|
bufmap->desc_array[0].page_array = &bufmap->page_array[offset] |
|
bufmap->desc_array[0].array_count = pages_per_desc = 1024 |
|
bufmap->desc_array[0].uaddr = (user_desc->ptr) + (0 * 1024 * 4096) |
|
offset += 1024 |
|
. |
|
. |
|
. |
|
bufmap->desc_array[9].page_array = &bufmap->page_array[offset] |
|
bufmap->desc_array[9].array_count = pages_per_desc = 1024 |
|
bufmap->desc_array[9].uaddr = (user_desc->ptr) + |
|
(9 * 1024 * 4096) |
|
offset += 1024 |
|
|
|
* buffer_index_array - a desc_count sized array of ints, used to |
|
indicate which of the IO buffer's partitions are available to use. |
|
* buffer_index_lock - a spinlock to protect buffer_index_array during update. |
|
* readdir_index_array - a five (ORANGEFS_READDIR_DEFAULT_DESC_COUNT) element |
|
int array used to indicate which of the readdir buffer's partitions are |
|
available to use. |
|
* readdir_index_lock - a spinlock to protect readdir_index_array during |
|
update. |
|
|
|
Operations |
|
---------- |
|
|
|
The kernel module builds an "op" (struct orangefs_kernel_op_s) when it |
|
needs to communicate with userspace. Part of the op contains the "upcall" |
|
which expresses the request to userspace. Part of the op eventually |
|
contains the "downcall" which expresses the results of the request. |
|
|
|
The slab allocator is used to keep a cache of op structures handy. |
|
|
|
At init time the kernel module defines and initializes a request list |
|
and an in_progress hash table to keep track of all the ops that are |
|
in flight at any given time. |
|
|
|
Ops are stateful: |
|
|
|
* unknown |
|
- op was just initialized |
|
* waiting |
|
- op is on request_list (upward bound) |
|
* inprogr |
|
- op is in progress (waiting for downcall) |
|
* serviced |
|
- op has matching downcall; ok |
|
* purged |
|
- op has to start a timer since client-core |
|
exited uncleanly before servicing op |
|
* given up |
|
- submitter has given up waiting for it |
|
|
|
When some arbitrary userspace program needs to perform a |
|
filesystem operation on Orangefs (readdir, I/O, create, whatever) |
|
an op structure is initialized and tagged with a distinguishing ID |
|
number. The upcall part of the op is filled out, and the op is |
|
passed to the "service_operation" function. |
|
|
|
Service_operation changes the op's state to "waiting", puts |
|
it on the request list, and signals the Orangefs file_operations.poll |
|
function through a wait queue. Userspace is polling the pseudo-device |
|
and thus becomes aware of the upcall request that needs to be read. |
|
|
|
When the Orangefs file_operations.read function is triggered, the |
|
request list is searched for an op that seems ready-to-process. |
|
The op is removed from the request list. The tag from the op and |
|
the filled-out upcall struct are copy_to_user'ed back to userspace. |
|
|
|
If any of these (and some additional protocol) copy_to_users fail, |
|
the op's state is set to "waiting" and the op is added back to |
|
the request list. Otherwise, the op's state is changed to "in progress", |
|
and the op is hashed on its tag and put onto the end of a list in the |
|
in_progress hash table at the index the tag hashed to. |
|
|
|
When userspace has assembled the response to the upcall, it |
|
writes the response, which includes the distinguishing tag, back to |
|
the pseudo device in a series of io_vecs. This triggers the Orangefs |
|
file_operations.write_iter function to find the op with the associated |
|
tag and remove it from the in_progress hash table. As long as the op's |
|
state is not "canceled" or "given up", its state is set to "serviced". |
|
The file_operations.write_iter function returns to the waiting vfs, |
|
and back to service_operation through wait_for_matching_downcall. |
|
|
|
Service operation returns to its caller with the op's downcall |
|
part (the response to the upcall) filled out. |
|
|
|
The "client-core" is the bridge between the kernel module and |
|
userspace. The client-core is a daemon. The client-core has an |
|
associated watchdog daemon. If the client-core is ever signaled |
|
to die, the watchdog daemon restarts the client-core. Even though |
|
the client-core is restarted "right away", there is a period of |
|
time during such an event that the client-core is dead. A dead client-core |
|
can't be triggered by the Orangefs file_operations.poll function. |
|
Ops that pass through service_operation during a "dead spell" can timeout |
|
on the wait queue and one attempt is made to recycle them. Obviously, |
|
if the client-core stays dead too long, the arbitrary userspace processes |
|
trying to use Orangefs will be negatively affected. Waiting ops |
|
that can't be serviced will be removed from the request list and |
|
have their states set to "given up". In-progress ops that can't |
|
be serviced will be removed from the in_progress hash table and |
|
have their states set to "given up". |
|
|
|
Readdir and I/O ops are atypical with respect to their payloads. |
|
|
|
- readdir ops use the smaller of the two pre-allocated pre-partitioned |
|
memory buffers. The readdir buffer is only available to userspace. |
|
The kernel module obtains an index to a free partition before launching |
|
a readdir op. Userspace deposits the results into the indexed partition |
|
and then writes them to back to the pvfs device. |
|
|
|
- io (read and write) ops use the larger of the two pre-allocated |
|
pre-partitioned memory buffers. The IO buffer is accessible from |
|
both userspace and the kernel module. The kernel module obtains an |
|
index to a free partition before launching an io op. The kernel module |
|
deposits write data into the indexed partition, to be consumed |
|
directly by userspace. Userspace deposits the results of read |
|
requests into the indexed partition, to be consumed directly |
|
by the kernel module. |
|
|
|
Responses to kernel requests are all packaged in pvfs2_downcall_t |
|
structs. Besides a few other members, pvfs2_downcall_t contains a |
|
union of structs, each of which is associated with a particular |
|
response type. |
|
|
|
The several members outside of the union are: |
|
|
|
``int32_t type`` |
|
- type of operation. |
|
``int32_t status`` |
|
- return code for the operation. |
|
``int64_t trailer_size`` |
|
- 0 unless readdir operation. |
|
``char *trailer_buf`` |
|
- initialized to NULL, used during readdir operations. |
|
|
|
The appropriate member inside the union is filled out for any |
|
particular response. |
|
|
|
PVFS2_VFS_OP_FILE_IO |
|
fill a pvfs2_io_response_t |
|
|
|
PVFS2_VFS_OP_LOOKUP |
|
fill a PVFS_object_kref |
|
|
|
PVFS2_VFS_OP_CREATE |
|
fill a PVFS_object_kref |
|
|
|
PVFS2_VFS_OP_SYMLINK |
|
fill a PVFS_object_kref |
|
|
|
PVFS2_VFS_OP_GETATTR |
|
fill in a PVFS_sys_attr_s (tons of stuff the kernel doesn't need) |
|
fill in a string with the link target when the object is a symlink. |
|
|
|
PVFS2_VFS_OP_MKDIR |
|
fill a PVFS_object_kref |
|
|
|
PVFS2_VFS_OP_STATFS |
|
fill a pvfs2_statfs_response_t with useless info <g>. It is hard for |
|
us to know, in a timely fashion, these statistics about our |
|
distributed network filesystem. |
|
|
|
PVFS2_VFS_OP_FS_MOUNT |
|
fill a pvfs2_fs_mount_response_t which is just like a PVFS_object_kref |
|
except its members are in a different order and "__pad1" is replaced |
|
with "id". |
|
|
|
PVFS2_VFS_OP_GETXATTR |
|
fill a pvfs2_getxattr_response_t |
|
|
|
PVFS2_VFS_OP_LISTXATTR |
|
fill a pvfs2_listxattr_response_t |
|
|
|
PVFS2_VFS_OP_PARAM |
|
fill a pvfs2_param_response_t |
|
|
|
PVFS2_VFS_OP_PERF_COUNT |
|
fill a pvfs2_perf_count_response_t |
|
|
|
PVFS2_VFS_OP_FSKEY |
|
file a pvfs2_fs_key_response_t |
|
|
|
PVFS2_VFS_OP_READDIR |
|
jamb everything needed to represent a pvfs2_readdir_response_t into |
|
the readdir buffer descriptor specified in the upcall. |
|
|
|
Userspace uses writev() on /dev/pvfs2-req to pass responses to the requests |
|
made by the kernel side. |
|
|
|
A buffer_list containing: |
|
|
|
- a pointer to the prepared response to the request from the |
|
kernel (struct pvfs2_downcall_t). |
|
- and also, in the case of a readdir request, a pointer to a |
|
buffer containing descriptors for the objects in the target |
|
directory. |
|
|
|
... is sent to the function (PINT_dev_write_list) which performs |
|
the writev. |
|
|
|
PINT_dev_write_list has a local iovec array: struct iovec io_array[10]; |
|
|
|
The first four elements of io_array are initialized like this for all |
|
responses:: |
|
|
|
io_array[0].iov_base = address of local variable "proto_ver" (int32_t) |
|
io_array[0].iov_len = sizeof(int32_t) |
|
|
|
io_array[1].iov_base = address of global variable "pdev_magic" (int32_t) |
|
io_array[1].iov_len = sizeof(int32_t) |
|
|
|
io_array[2].iov_base = address of parameter "tag" (PVFS_id_gen_t) |
|
io_array[2].iov_len = sizeof(int64_t) |
|
|
|
io_array[3].iov_base = address of out_downcall member (pvfs2_downcall_t) |
|
of global variable vfs_request (vfs_request_t) |
|
io_array[3].iov_len = sizeof(pvfs2_downcall_t) |
|
|
|
Readdir responses initialize the fifth element io_array like this:: |
|
|
|
io_array[4].iov_base = contents of member trailer_buf (char *) |
|
from out_downcall member of global variable |
|
vfs_request |
|
io_array[4].iov_len = contents of member trailer_size (PVFS_size) |
|
from out_downcall member of global variable |
|
vfs_request |
|
|
|
Orangefs exploits the dcache in order to avoid sending redundant |
|
requests to userspace. We keep object inode attributes up-to-date with |
|
orangefs_inode_getattr. Orangefs_inode_getattr uses two arguments to |
|
help it decide whether or not to update an inode: "new" and "bypass". |
|
Orangefs keeps private data in an object's inode that includes a short |
|
timeout value, getattr_time, which allows any iteration of |
|
orangefs_inode_getattr to know how long it has been since the inode was |
|
updated. When the object is not new (new == 0) and the bypass flag is not |
|
set (bypass == 0) orangefs_inode_getattr returns without updating the inode |
|
if getattr_time has not timed out. Getattr_time is updated each time the |
|
inode is updated. |
|
|
|
Creation of a new object (file, dir, sym-link) includes the evaluation of |
|
its pathname, resulting in a negative directory entry for the object. |
|
A new inode is allocated and associated with the dentry, turning it from |
|
a negative dentry into a "productive full member of society". Orangefs |
|
obtains the new inode from Linux with new_inode() and associates |
|
the inode with the dentry by sending the pair back to Linux with |
|
d_instantiate(). |
|
|
|
The evaluation of a pathname for an object resolves to its corresponding |
|
dentry. If there is no corresponding dentry, one is created for it in |
|
the dcache. Whenever a dentry is modified or verified Orangefs stores a |
|
short timeout value in the dentry's d_time, and the dentry will be trusted |
|
for that amount of time. Orangefs is a network filesystem, and objects |
|
can potentially change out-of-band with any particular Orangefs kernel module |
|
instance, so trusting a dentry is risky. The alternative to trusting |
|
dentries is to always obtain the needed information from userspace - at |
|
least a trip to the client-core, maybe to the servers. Obtaining information |
|
from a dentry is cheap, obtaining it from userspace is relatively expensive, |
|
hence the motivation to use the dentry when possible. |
|
|
|
The timeout values d_time and getattr_time are jiffy based, and the |
|
code is designed to avoid the jiffy-wrap problem:: |
|
|
|
"In general, if the clock may have wrapped around more than once, there |
|
is no way to tell how much time has elapsed. However, if the times t1 |
|
and t2 are known to be fairly close, we can reliably compute the |
|
difference in a way that takes into account the possibility that the |
|
clock may have wrapped between times." |
|
|
|
from course notes by instructor Andy Wang |
|
|
|
|