BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Unleashing the Power of .NET Big Memory and Memory Mapped Files

Unleashing the Power of .NET Big Memory and Memory Mapped Files

Bookmarks

Key Takeaways

  • Web servers often have far more memory than the .NET GC can efficiently handle under normal circumstances.
  • The performance benefits of a caching server are often lost due to increased network costs.
  • Memory Mapped Files are often the fastest way to populate a cache after a restart.
  • The goal of server-side tuning is to reach the point where your outbound network connection is saturated. This is obtained by minimizing CPU, disk, and internal network usage.
  • By keeping object graphs in memory, you can obtain the performance benefits of a graph database without the complexity.

In continuation of the Big Memory topic on the .NET platform (part1, part2), this article describes the benefits of utilization of large data sets in-process on the managed CLR server environments using Agincore’s Big Memory Pile.

Overview

RAM is very fast and affordable these days, yet is ephemeral. Every time the process restarts, memory is cleared out and everything has to be reloaded from scratch. To address this we have recently added Memory Mapped Files support to our solution - NFX Pile. With memory mapped files, the data can be quickly fetched from disk after a restart.

Overall, the Big Memory approach is beneficial for developers and businesses as it shifts the paradigm of high-performance computing on the .NET platform. Traditionally Big Memory systems were built in C/C++ style languages where you primarily dealt with strings and byte arrays. But it is hard to solve any real world business problems while focusing on low level data structures. So instead we are going to concentrate on CLR objects. Memory Pile allows developers to think in terms of object instances, and work with hundreds of millions of those instances that have properties, code, inheritance and other CLR-native functionality.

This is different from language-agnostic object models, as proposed by some vendors (i.e. ones that interoperate Java and .NET), which introduce extra transformations, and all of the out-of-process solutions that require extra traffic/context switching/serialization. Instead, we’re going to discuss in-process local heaps, or rather “Piles” of objects, which exist in managed code in large byte arrays. Individually, these objects are invisible to the GC.

Use Cases

Why would anyone use dozens or hundreds of gigabytes of RAM in a first place? Here are a few tested use-cases of the Big Memory Pile technology.

The first thing that comes to mind is cache. In an E-Commerce backend we store hundreds of thousands of products ready to be displayed as detailed catalog listings. Each may have dozens of variations. When you build a catalog view listing 30+ products on a single screen, you’d better get those objects pretty quickly even for a single user scrolling a page with progressive loading.  Why not use Redis or Memcached? Because we do the same thing only in-process, saving on network traffic and serialization. Transforming data into network packets into objects can be a surprisingly expensive operation. Wouldn’t you use a Dictionary<id, Product> (or IMemoryCache) if it were possible to hold all several hundred thousand products and their variations?  Caching data alone provided enough motivation for using RAM, but there is much more...

In another cache use-case - a REST API server we were able to pre-serialize around 50 million rarely changing JSON vectors as UTF8-encoded byte arrays. The byte[], which was around 1024 bytes, could then be served directly into Http stream, making the network the bottleneck at around 80,000 req/sec.

Working with complex object graphs is another perfect case for Pile. In a social app, we needed to traverse the conversation threads on Twitter. When tracing who said what and when on a social media site, the ability to hold hundreds of millions of small vectors in memory is invaluable. We might as well have used a graph DB, however in our case we are the graph DB, right in the same process (it is a component hosted by our web MVC app). We’re now handling 100K+ REST API calls/sec, which is the limit of our network connection, while keeping the CPU usage low.

In this, and other use cases, background workers asynchronously update the social graph as changes come in. In many cases, such as the product catalog we talked about earlier, this can be done preemptively. You couldn’t do that with a normal cache that only holds a subset of the interesting data.

How it Works

Big Memory Pile solves the GC problems by using the transparent serialization of CLR object graphs into large byte arrays, effectively “hiding” the objects from GC’s reach. Not all object types need to be fully serialized though, - string and byte[] objects are written into Pile as buffers bypassing all serialization mechanisms yielding over 6 M inserts/second for a 64 char string on a 6 core box.

The key benefit of this approach is its practicality. The real-life cases have shown the phenomenal overall performance while using the native CLR object model - this saves development time because you don't need to create special-purpose DTOs, and works faster, as there are no extra copies in-between that need to be made.

Overall, Pile has turned much of the I/O bound code into a CPU-bound code. What should have normally been a typical case for an async (with i/o bound) implementation, became 100% sync linear code, which is simpler and performs better as Tasks and other async/await goodies have a hidden cost (see here and here) when doing multi 100K ops/sec on a single server.

Big Memory Mapped Files

In-memory processing is fast and easy to implement, however when the process restarts you lose the dataset, which is large by definition (tens to hundreds of gigabytes). Pulling all of that data from its original source can be very time consuming, time that you can’t afford just after a restart.

To solve this we added Memory Mapped File (MMF) support using standard .NET classes: MemoryMappedFile and MemoryMappedViewAccessor. Now, instead of using byte[] as a backing store for memory segments, we use MemoryMappedViewAccessor instance and some low-level tricks to access data by pointers directly - all of this is still done using standard C#, no C++ is involved as we want to keep everything simple, especially the build chain.

Writing to memory via MemoryMappedViewAccessor (MMFMemory class) modifies virtual memory pages in the OS layer directly. The OS tries to fit those pages in physical RAM, if it can’t it swaps them out to disk. A nice feature of writing Pile into MMF is you don’t need to re-read everything from disk even after the process restarts soon after shutdown. The OS keeps the pages that have been mapped into process address space around even AFTER the process termination. Upon start, the MMFPile can access the pages already in RAM in a much quicker fashion than reading from disk anew.

Do note that MMFPile yields slower performance than DefaultPile (based on byte[]) due to the unmanaged code context switch done in the MMFMemory class.

Here are some test results:

Benchmark insert 200,000,000 string[32] 12 threads:

(Machine: Intel Core I7 3.2 Ghz, 6 Core, Win 7 64bit, VS2017, .NET 4.5)

DefaultPile  

  • 24 sec @ 8.3 M insert/sec = 8.5 Gb memory; Full GC < 8 ms

MMFPile

  • 41 sec @ 4.9 M insert/sec = 8.5 GB memory + disk; Full GC < 10 ms
  • Flush all data to disk on Stop(): 10 sec
  • Read all data back to ram:  48 sec = ~ 177 mbyte/sec

As you can see, the MMF solution does have an extra cost; the throughput is lower due to unmanaged MMF transition, and once you mount the Pile back from disk, it takes time proportional to the amount of memory allocated to warm-up the RAM with data from disk. However you do not need to wait to load the whole working set back, as the MMFPile is instantly available for writes and reads after the Pile.Start(), the full load of all data is going to take time, in the example above the 8.5 GB dataset takes 48 sec to warm-up in RAM on a mid-grade SSD.

Benchmark insert 200,000,000 Person (class with 7 fields) objects 12 threads:

DefaultPile  

  • 85 sec @ 2.4 M insert/sec = 14.5 Gb memory; Full GC < 10ms

MMFPile     

  • 101 sec @ 1.9 M insert/sec = 14.5 GB memory + disk; Full GC < 10ms
  • Flush all data to disk on Stop(): 30 sec
  • Read all data back to ram:  50 sec = ~ 290 mbyte/sec

Other Improvements

Since our previous post on InfoQ we have made a number of improvements to the NFX.Pile:

Raw Allocator / Layered Design

The Pile implementation is now better layered, allowing us to treat string and byte[] as directly writeable/readable from the large contiguous blocks of RAM. The whole serialization mechanism is bypassed for byte[] completely, making it possible to use Pile as just a raw byte[] allocator.

var ptr = pile.Put(“abcdef”);//this will bypass all serializers
                             //and use UTF8Encoding instead
var original = pile.Get(ptr) as string;

Performance Boost

The segment allocation logic has been revised and yields 50%+ better performance during inserts from multiple threads due to introduction of sliding window optimization that tries to avoid multi-threading contention. Also, strings and byte[] are now bypassing the serializer completely yielding 5M+ inserts/sec for most cases (200%+ improvement)

Enumeration

It is now possible to get the contents of the whole pile as it implements IEnumerable<PileEntry> interface. PileEntry struct

foreach(var entry in pile)
{
  Console.WriteLine(“{0} points to {1} bytes”.Args(
                         entry.Pointer, 
                         entry.Size));
  var data = pile.Get(entry.Pointer);
  …
}

Durable Cache

For performance reasons, the default mode for the cache is “Speculative”. In this mode hash code collisions may cause lower priority items to be ejected from the cache even when there is otherwise enough memory.

The cache server can now store data in a “Durable” mode, which works more like a normal dictionary. Because durable mode needs to do rehashing in the bucket, it is 5-10% slower than speculative mode. This is hardly noticeable for most applications, but you’ll need to test to see what is best for your particular situation.

    //Specify TableOptions for ALL tables, make tables DURABLE
  cache.DefaultTableOptions = new TableOptions("*") 
  {
    CollisionMode = CollisionMode.Durable
  };

In-Place Object Mutation and Pre-allocation

It is now possible to alter objects at the existing PilePointer address. The new API Put(PilePointer...) allows for placing a different payload at the existing location. If the new payload does not fit in the existing block, then Pile creates an internal link to the new location (a la file system link in *nix systems) effectively making the original pointer an alias to the new location. Deleting the original pointer deletes the link and what it points to. The aliases are completely transparent and yield the target payload on read.

You can also pre-allocate more RAM for the future payload by specifying the preallocateBlockSize in the Put() call.

//Implement linked list stored in Pile
public class ListNode
{
  public PilePointer Previous;
  public PilePointer Next;
  public PilePointer Value;
}
...
 private IPile m_Pile;//big memory pile 
 private PilePointer m_First;//list head
 private PilePointer m_Last;//list tail
...
//Append a person instance to a person linked list stored in a Pile
//returns last node
public PilePointer Append(Person person)
{
  var newLast = new ListNode{ Previous = m_Last, 
                              Next = PilePointer.Invalid, 
                              Value = m_Pile.Put(person)};
    
  var existingLast = m_Pile.Get(m_Last);
  existingLast.Next = node;
  m_Pile.Put(m_Last, existingLast);//in-place edit at the existing ptr m_Last
  m_Last = m_Pile.Put(newLast);//add new node to the tail

  return m_Last;
}  

For more information see our video: .NET Big Memory Object Pile - Use 100s of millions of objects in RAM

Links

About the Author

Dmitriy Khmaladze has over 20 years of IT experience in the US. Startups and Fortune 500 clients; Galaxy Hosted, Pioneered SaaS for medical industry in 1998; 15+years research: language and compiler design, distributed architecture; System programming and architecture, C/C++,.NET, Java,  Android, IOS, Web design, HTML5, CSS, JavaScript, RDBMSs and NoSQL/NewSQL.

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Load From Disk

    by Forest Snyder,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Do I need to wait until 500G data loads in memory form file? It is critical for our usage pattern to access this right after process botts

  • Load Persisted MMF?

    by Dan L,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    This is kind of exciting. How do you persist an MMFPile and then reload it from disk? My data directory has data stored in it, yet every time I re-run my process, the pile seems to be empty.

  • Re: Load From Disk

    by Dmitriy Khmaladze,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Hi, Forest

    no you do not need to wait for 500 gigabytes to load a whole.
    What happens is: you Start() the pile and it mounts segments from disk in < 1 sec,
    however it does not know the whole statistics as of yet, - you can instantly read pointers pointing to those segments, you can instantly delete those pointers but you can not write into those segments until the get crawled() - analyzed by the async thread. This thread may take minutes to load your data. Its ok.

    Until it does - the new writes will go towards the end of the MMFPile.

    To summarize: you may use MMFPile in 1 sec after start, IF you need the full statistic (which most likely you do not for operation) then you wait.
    Statistics = total object count, bytes used etc...

  • Re: Load Persisted MMF?

    by Dmitriy Khmaladze,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Hey Dan, thanks for your question.

    The answer is this: the pile is not empty - it gets "crawled()" async by a separate worker, the statistics (how many objects, bytes yadayada...) gets built as the thread reads the stuff to memory, BUT that does not mean that you can dereference stuff right away. The MMF files are handled by the OS, so if you try to do a scattered read it will work just fine right after the load. See the PileForm , run this guy here to see how it works graphically using WinfForms: github.com/aumcode/nfx/tree/master/Source/Testi...

    dmitriyk [at] agnicore [com] shoot questions

  • Re: Load Persisted MMF?

    by Dan L,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Thanks! I got the MMFPile working now, but now I'm trying to use a LocalCache with MMFPile, and my results aren't getting saved. Is the cache expected to persist when using an MMFPile?

  • Re: Load Persisted MMF?

    by Dmitriy Khmaladze,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    With LocalCache it is tricky as it is purposed for in-memory use for fast indexing.
    On shutdown you will lose index, but can keep MMFPile intact.
    What you can do is reconstruct the cache by enuming through pile after load, which is going to cause some delay. We have yet to release into open source our full cache server that stores keys in the balanced index in MMFPile using version tolerant serializer - the code is used in proprietary system.

  • Re: Load From Disk

    by Jonathan Allen,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    A nice feature of memory mapped files is that they are loaded on demand by the OS.

    Unless you try to iterate through the entire collection, the OS is going to pick up your data from disk one page at a time until it runs out of RAM.

    Now lets say your application crashes and restarts. Since the OS already has the file mapped into memory, there is no delay. You're not "copying" the file into your application's memory. Rather, your application is using the file/file cache as memory.

    Memory mapped files are often used for cross-process communication. If two applications map the same file to memory, they can see each other's changes. Again, this works because the file is kept in memory at the OS level. (I wonder how two Pile-based applications would handle this.)

  • Re: Load From Disk

    by Dmitriy Khmaladze,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Regarding IPC using MMF as provided by Pile.
    short answer: In the open-source NFX code, the MMFPile, the MMF mounted into pile are for exclusive use per process - this is purposely designed this way for simplicity and speed.
    Besides, the IPC in NFX is done via Glue. There is no practical need to share the memory using Pile for IPC.

    The long answer:
    Pile is a memory manager, which is a thread safe state machine. As such, it needs to synchronize the access to segment buffers and free slot pool which are not stored in the MMF. MMF only stores the actual data kept in Pile, but not the freelists and other metadata. This is done on purpose as syncing this stuff between processes would have been either prohibitive performance-wise or very complex to implement. Now, we are ONLY talking about the PARTICULAR implementation of the IPile interface as provided by NFX.

    Internally we do have a distributed "huge pile" which spans multiple machines, but it is not open as of yet as it is a part of cluster Agni OS.

  • Durability

    by Dan L,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Dmitriy, once you write data to an MMFPile, is the write guaranteed to be persisted? Will data be lost if the machine or application crashes before the data gets flushed from memory to file?

  • Re: Durability

    by Dmitriy Khmaladze,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    There is a a property on MMFPile SyncIntervalMs which writes the "dirty" segments to disk. HOWEVER I need to merge it in the NFX repository by Monday

  • Re: Load Persisted MMF?

    by Martin Strimpfl,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Great work, Dmitriy.

    I'm trying to reconstruct the cache however whenever I try to get the data from the Pile, I get an exception:

    Bad SLIM format header'

    Here is the code, the line throwing it is the cache.Pile.Get(entry.Pointer)


    var cache = new LocalCache();
    cache.Pile = new MMFPile(cache) { DataDirectoryRoot = @"D:\Temp\MMF\" };
    cache.Start();

    var persons = cache.GetOrCreateTable<int>("Persons");

    foreach (var entry in cache.Pile)
    {
    var data = cache.Pile.Get(entry.Pointer);
    var person = data as Person;
    if (person != null)
    {
    persons.PutPointer(person.Id, entry.Pointer);
    }
    }</int>

  • Re: Load Persisted MMF?

    by Dmitriy Khmaladze,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Hi Martin,

    Add the filtering for entry.Type.
    The enumerator returns all internal "guts", so the Where()
    should help:

    foreach(var entry in cache.Pile.Where( e => e.Type != PileEntry.DataType.Link))
    {....}

  • Re: Load Persisted MMF?

    by Martin Strimpfl,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    I don't think it can help. I tried to debug and here is what I found:

    the slim serializer is using the NFX.Serialization.Slim.TypeRegistry class to find out the type for deserialization. To deserialize the data, the SlimeSerializer is trying to first read the type's id from the memory stream and then using this id to get the Type from the TypeRegistry. However there is no such Type at that time so the exception is thrown.
    To avoid it, I have to put the person object first to the Pile, so the TypeRegistry registers the Person Type (and if more types were stored previsouly, I need to do that in the exect order so the TypeRegistry is storing the type with the same id).

    Is there a way to register the types before the TypeRegistry is used so I can be certain of the position?

    Or am I missing something?

  • Re: Load Persisted MMF?

    by Dmitriy Khmaladze,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Martin,
    you are not missing anything, and did a fantastic job!
    This is me missing an improper merge issue which I did not even realize was there.

    The MMFPile writes its type registry to a file (near the data files). On start it reads it back. This code was absent on GitHub and Nuget (we use internal company repository and I incorrectly merged older code)

    I have just synced the internal repo and GitHub and also released a new NuGet, so this problem is solved.

    Thanks for finding the problem!

  • MMF is a great idea !

    by Gabriel Rabhi,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    MMF is a great idea ! Good job. You have only to be careful to have an integrity mechanism by validate each change by only one byte change, and implement a recovery code that invalidated not tagged blocks. Few years ago i've written a storage insensitive to dirty stopping, based on the write of one byte on the hard drive, and it work perfectly (must take care of caching mechanisms - hard drive are not writing blocks in logical order, but taking care of cache and write head displacement, and electrical accu are here to finish cache flush). If you don't have a such thing, you can consider that a 4kb page should be written all or not at all. But without a last update validation of modifications with a final write, you cannot certify the file will be never corrupted. Further, the MMF block manager is not logically managing backup order. You have to do single write flush to be sure that the validation data are persisted after the content itself. You can read the source code of the extremely good LMDB library, based on MMF and incredibly robust against hard stopping.

  • Re: Durability

    by Zubair Mansoor,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    I could not find this property SyncIntervalMs , could you please tell me is this still available ? I want to use MMFPile to store data to disk so that in case of crash when application will start again it can load its state saved using MMFPile.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT