- 1 Overview
- 2 Notes on compatibility with the real cache manager
- 3 NewCC basic operation
- 4 Ok so about NewMM
ReactOS' current cache manager is nice in some ways, but is overly complicated, is unlike the real windows cache manager in significant ways, and is too tightly tied to section.c in mm.
Notes on compatibility with the real cache manager
The windows cache manager (and newcc) differ from old cc in at least these ways:
Note that the following information was gleaned from studying the behavior of several open source or source-available filesystem drivers that use Cc, as well as careful rereading of the msdn pages and Rajeev Nagar's classic filesystem internals book.
- Each BCB as returned from a Cc pin function is a data structure that identifies a cache stripe, and is not synonymous with any whole-file data structures. This must be the case for CcSetBcbOwnerPointer to work.
- The structure pointed to by SharedCacheMap is linked to by the PrivateCacheMap structure in each FileObject and they *must not* be the same for reasons that will be explained.
- There is no need for a reference count on the SharedCacheMap as such. The only question about the lifetime of the SharedCacheMap is whether any PrivateCacheMap points to it. This is why filesystem drivers are allowed to call CcInitializeCache and CcUninitializeCache an unmatched number of times, and also explains why filesystem drivers are allowed to copy the SharedCacheMap pointer from one file object to another before calling CcInitializeCache. In all cases, the filesystem stores a pointer to the SharedCacheMap in the FCB of the file, copying it to each FileObject that shares the FCB, then calling CcInitializeCache in order to tell Cc to make a new PrivateCacheMap and make it point to the SharedCacheMap. It is safe to call CcUninitializeCache many times because its job is to manage the PrivateCacheMap and delete the SharedCacheMap only if it deleted the last PrivateCacheMap the file was using. For most filesystems, this works out because the FCB will have the same lifetime as the SharedCacheMap, being deleted when the last reference in a FileObject disappears.
- Stream file objects must be able to silently share the SharedCacheMap from the original.
- The real section system does *not* require caching to be initialized on a file, and avoids two logical races with Cc by using the standard file functions (as well as paging io) to touch the file (one of these is the ability to query file size at startup, the other relates to resizing a section using MmExpandSection).
NewCC basic operation
First off, newcc is purely a client of the section system, consuming ROS_SECTION_OBJECTs and mapping them in system space. Each BCB contains a section object pointer that can be dereferenced with ObDereferenceObject, as BaseAddress that can be used with MmUnmapSegmentInSystemSpace if it is mapped, as well as a record of its offset, the associated FILE_OBJECT and a pointer to the NOCC_SHARED_CACHE_MAP. The PrivateCacheMap structure in this scenario is just a member of a linked list that identifies the file object as being a member of a certain NOCC_SHARED_CACHE_MAP. The BCBs themselves are also enlisted in the NOCC_SHARED_CACHE_MAP to make them interable and identifiable.
Because newcc is a client of Mm, it does not attempt to identify or interfere with page allocation or to use pages that may also be used by Mm, except to assume that mappings created by MmMapInSystemSpaceAtOffset are valid.
There is a fixed size array of cache stripes that are currently mapped. The fixed size is due to the limited address space in the kernel space. Since the section objects on which cc is based don't lose their pages when they're unmapped, we can map and unmap stripes at a whim, based on the needs of clients. The number of stripes mapped at one time is adjustable, but should never be set less than 16 (from empirical observation).
The flow of use is generally like this:
A client does CcInitializeCache and does CcPinRead or CcPreparePinWrite a number of times. Cache stripes are reference counted and must be released with CcUnpinBcb (CcpUnpinData internally), but since we also map and unmap the stripes, we take a reference of our own. Cache stripes cease being owned by client code when the reference count dips below 2. They are freed when the reference count reaches 0 as a result of being evicted from CcCacheSections. When this happens, it is always because a new cache stripe is being allocated by CcpAllocateCacheSections.
When a cache stripe loses its last user reference, it is flushed. This is to keep (at least) our fastfat happy, because it uses both disk io and cache io to refer to directories and the fat, and may query the disk directly after modification with the cache.
Cache stripes are unmapped when they lose their last reference as above, and might be flushed if the system is shutting down at that time. Even worse, the filesystem may have an exclusive reference to a file when mapping another region of the same, leading to deadlock. For this reason, we launch a work item to finally flush, unmap and destroy the section object for the discarded BCB.
Ok so about NewMM
Sections have been reimplemented from the ground up without the prior newcc support, fetching data from the underlying FILE_OBJECT as necessary using ordinary PAGING_IO. Doing this allows us to break a dependency loop between mm and cc regarding file size when the memory manager is creating a section for the first time and when the section is being resized. In these case, we rely on the file system driver to notify the cache manager when the file will be resized, which is consistent with all sources of information about cc.
Wide AVL Page Tables
In order to do this, Mm section segments now only carry an offset when they're part of an image section. In order to make memory use more efficient, and to make 64-bit offsets possible, the page-based, 4-million entry tables that used to be a part of sections have been replaced by avl tree nodes with 256 entries per node.
In order to allow pages to be occupied by file data, but not mapped (cache segments are allowed to be unmapped, but still present), we use rmap to remember which pages belong to which sections. Since pages have a reference count of their own, we use the rmap in combination with the page reference count to determine how many owners a page has. Since pages are referenced when mapped, unreferenced when freed, and unreferenced when removed from a segment page table, the code has a natural flow and proper ownership is maintained in a way that matches the consumer's intention. Segment rmaps are exclusively managed by the code in sptab.c, which adds and deletes them depending on a page's use in a segment.
Segment rmaps are identified by holding 0xffffffxx in the 'Address' slot and the address of a page table AVL node in the 'Process' slot. The AVL nodes contain a pointer to their home segment, as well as a LARGE_INTEGER offset. The low 8 bits of the Address slot can be used to obtain the index of the page entry in the AVL node, and thus to give the 20th through 12th bits of the page address in the underlying file, allowing us to tell where in the file the page is mapped and to get back to the same entry by offset. This system allows a segment page table to hold an ordinary reference to a page and still allow us to evict it and keep track of it when it isn't mapped.
PageOPs and deep stacks
Although it isn't strictly necessary, PAGEOPs have been removed in favor of a sentintel value, the 0x7ffffff pagefile reference or MM_WAIT_ENTRY, which means that the slot is being worked on and the other consumer should wait. This is sufficient to maintain order among competing threads, removes hundreds of lines of copied code for handling various contingencies of PAGEOPs, and does not require a lookup in a list. Note that in the checked in code, PAGEOPs are used for image sections and pagefile sections, but no long data sections. It's likely that I will import more of this work from bitbucket once NewCC is better settled into the code base. Upon encountering an MM_WAIT_ENTRY entry in a page table, the consumer must eventually wait on MmWaitPageEvent before trying again. Anyone who puts MM_WAIT_ENTRY into a page table *must* remove it before returning control to the user process.
Handling page faults may now require multiple levels of fault handling, since a filesystem may ultimately need to access a cached copy of an offsets block in order to know where to do a paging read to satisfy a fault. Cases like this previously would have run out of stack space. Now, the code has been generalized to launch work itmes to cover faults when multiple faults are in play simultaneously. This ensures that the stack won't be blown.
reqtools.c, locks and blocking resource acquisition
Flow of control in fault handlers is different from before. Handlers do not receive locked locks and unlock and relock them chaotically during their runs. Each handler function either runs with a certain lock required and does not unlock it, or locks whatever it needs in its body and returns with the same locks unheld. In order to consolidate some code and make it easier for consumers to provide functions that satisfy page faults, a mechanism has been added that allows the provider to pass out a request that might block, and which must be undertaken without holding any resources.
Functions which may be invoked in this way are invoked by returning STATUS_MORE_PROCESSING_REQUIRED, and having filled in the provided MM_REQUIRED_RESOURCES. Once the request has been fulfilled, the fault provider will be called again. Doing this allows fault providers to work as state machines, using the MM_REQUIRED_RESOURCES struct and the page entry as state that is modified without blocking. When blocking is required, locks are released by the caller and the blocking call is done in a way that ensures that parallel operation is not impeded and that the integrity of passages of code that consume locks is strictly maintained. From a programmer's perspective, this means that religiously following a policy of releasing each lock the programmer takes, and only those, is guaranteed to be appropriate.
NewMM relies on calls to trimming functions to reclaim memory, since pages are not evicted solely on the basis of leaving the address space. This necessitates a robust way of evicting pages.
Page eviction from sections is done in three stages
Stage 1: Unmapping
When a page is identified for eviction, first, we make an attempt to remove all rmaps *except* the segment rmap by unmapping the page from each address space in which it appears, and promoting the dirty bit to the segment page entry. This phase aborts if an address space contains an MM_WAIT_ENTRY. This process is safe because it doesn't change the state of the page. If the process aborts, or we decide not to continue, each process will simply reacquire the page by faulting it back. Nothing (including the actual resident data) has changed. Note that pages which are copies of CoW mappings, or are evicted swaps of CoW mappings stay and do not need to be evicted.
Stage 2: Writeback
The page is referenced once while we try to write it back. If the page is dirty, an MM_WAIT_ENTRY is written in its slot in the segment page table and it is written back. The page entry will be restored if the operation failed, including its dirty state.
Stage 3: Removal
The page entry is zeroed in the segment, and the last reference to the page is released, allowing the page to be reallocated. If any subsequent fault occurs, or a thread that previously read an MM_WAIT_ENTRY wakes up, the page will have to be faulted again from the original file.