Linux handles memory allocation by creating a set of pools of memory objects of fixed sizes. Allocation requests are handled by going to a pool that holds sufficiently large objects, and handing an entire memory chunk back to the requester. The memory management scheme is quite complex, and the details of it are not normally all that interesting to device driver writers. After all, the implementation can change -- as it did in the 2.1.38 kernel -- without affecting the interface seen by the rest of the kernel.
The one thing driver developers should keep in mind, though, is that the kernel can allocate only certain predefined fixed-size byte arrays. If you ask for an arbitrary amount of memory, you're likely to get slightly more than you asked for, up to twice as much. Also, programmers should remember that the minimum memory that kmalloc handles is as big as 32 or 64, depending on the page size used by the current architecture.
The data sizes available are generally powers of two. In the 2.0 kernel, the available sizes were actually slightly less than a power of two, due to control flags added by the management system. If you keep this fact in mind, you'll use memory more efficiently. For example, if you need a buffer of about 2000 bytes and run Linux 2.0, you're better off asking for 2000 bytes, rather than 2048. Requesting exactly a power of two is the worst possible case with any kernel older than 2.1.38 -- the kernel will allocate twice as much as you requested. This is why scull used 4000 bytes per quantum instead of 4096.
kmem_cache_t * kmem_cache_create(const char *name, size_t size, size_t offset, unsigned long flags, void (*constructor)(void *, kmem_cache_t *, unsigned long flags), void (*destructor)(void *, kmem_cache_t *, unsigned long flags) );
The function creates a new cache object that can host any number of memory areas all of the same size, specified by the size argument. The name argument is associated with this cache and functions as housekeeping information usable in tracking problems; usually, it is set to the name of the type of structure that will be cached. The maximum length for the name is 20 characters, including the trailing terminator.
The offset is the offset of the first object in the page; it can be used to ensure a particular alignment for the allocated objects, but you most likely will use 0 to request the default value. flags controls how allocation is done, and is a bit mask of the following flags:
scullc is a complete example that can be used to make tests. It differs from scullonly in a few lines of code. This is how it allocates memory quanta:
/* Allocate a quantum using the memory cache */ if (!dptr->data[s_pos]) { dptr->data[s_pos] = kmem_cache_alloc(scullc_cache, GFP_KERNEL); if (!dptr->data[s_pos]) goto nomem; memset(dptr->data[s_pos], 0, scullc_quantum); }
And these lines release memory:
for (i = 0; i < qset; i++) if (dptr->data[i]) kmem_cache_free(scullc_cache, dptr->data[i]); kfree(dptr->data);
To support use of scullc_cache, these few lines are included in the file at proper places:
/* declare one cache pointer: use it for all devices */ kmem_cache_t *scullc_cache; /* init_module: create a cache for our quanta */ scullc_cache = kmem_cache_create("scullc", scullc_quantum, 0, SLAB_HWCACHE_ALIGN, NULL, NULL); /* no ctor/dtor */ if (!scullc_cache) { result = -ENOMEM; goto fail_malloc2; } /* cleanup_module: release the cache of our quanta */ kmem_cache_destroy(scullc_cache);
Memory quanta allocated by scullp are whole pages or page sets: the scullp_order variable defaults to 0 and can be specified at either compile time or load time.
The following lines show how it allocates memory:
/* Here's the allocation of a single quantum */ if (!dptr->data[s_pos]) { dptr->data[s_pos] = (void *)_ _get_free_pages(GFP_KERNEL, dptr->order); if (!dptr->data[s_pos]) goto nomem; memset(dptr->data[s_pos], 0, PAGE_SIZE << dptr->order); }
The code to deallocate memory in scullp, instead, looks like this:
/* This code frees a whole quantum set */ for (i = 0; i < qset; i++) if (dptr->data[i]) free_pages((unsigned long)(dptr->data[i]), dptr->order);
At the user level, the perceived difference is primarily a speed improvement and better memory use because there is no internal fragmentation of memory. We ran some tests copying four megabytes from scull0 to scull1 and then from scullp0 to scullp1; the results showed a slight improvement in kernel-space processor usage.
The performance improvement is not dramatic, because kmalloc is designed to be fast. The main advantage of page-level allocation isn't actually speed, but rather more efficient memory usage. Allocating by pages wastes no memory, whereas using kmalloc wastes an unpredictable amount of memory because of allocation granularity.
Memory allocated with vmalloc is released by vfree, in the same way that kfree releases memory allocated by kmalloc.
Like vmalloc, ioremap builds new page tables; unlike vmalloc, however, it doesn't actually allocate any memory. The return value of ioremap is a special virtual address that can be used to access the specified physical address range; the virtual address obtained is eventually released by calling iounmap. Note that the return value from ioremap cannot be safely dereferenced on all platforms; instead, functions like readb should be used. See "Directly Mapped Memory" in Chapter 8, "Hardware Management"for the details.
It's worth noting that for the sake of portability, you should not directly access addresses returned by ioremap as if they were pointers to memory. Rather, you should always use readb and the other I/O functions introduced in "Using I/O Memory", in Chapter 8, "Hardware Management". This requirement applies because some platforms, such as the Alpha, are unable to directly map PCI memory regions to the processor address space because of differences between PCI specs and Alpha processors in how data is transferred.
There is almost no limit to how much memory vmalloc can allocate and ioremap can make accessible, although vmalloc refuses to allocate more memory than the amount of physical RAM, in order to detect common errors or typos made by programmers. You should remember, however, that requesting too much memory with vmalloc leads to the same problems as it does with kmalloc.
The module allocates memory 16 pages at a time. The allocation is done in large chunks to achieve better performance than scullp and to show something that takes too long with other allocation techniques to be feasible. Allocating more than one page with _ _get_free_pages is failure prone, and even when it succeeds, it can be slow. As we saw earlier, vmalloc is faster than other functions in allocating several pages, but somewhat slower when retrieving a single page, because of the overhead of page-table building. scullv is designed like scullp. order specifies the "order'' of each allocation and defaults to 4. The only difference between scullv and scullp is in allocation management. These lines use vmalloc to obtain new memory:
/* Allocate a quantum using virtual addresses */ if (!dptr->data[s_pos]) { dptr->data[s_pos] = (void *)vmalloc(PAGE_SIZE << dptr->order); if (!dptr->data[s_pos]) goto nomem; memset(dptr->data[s_pos], 0, PAGE_SIZE << dptr->order); }
And these lines release memory:
/* Release the quantum set */ for (i = 0; i < qset; i++) if (dptr->data[i]) vfree(dptr->data[i]);
If you compile both modules with debugging enabled, you can look at their data allocation by reading the files they create in /proc. The following snapshots were taken on two different systems:
salma% cat /tmp/bigfile > /dev/scullp0; head -5 /proc/scullpmem Device 0: qset 500, order 0, sz 1048576 item at e00000003e641b40, qset at e000000025c60000 0:e00000003007c000 1:e000000024778000 salma% cat /tmp/bigfile > /dev/scullv0; head -5 /proc/scullvmem Device 0: qset 500, order 4, sz 1048576 item at e0000000303699c0, qset at e000000025c87000 0:a000000000034000 1:a000000000078000 salma% uname -m ia64 rudo% cat /tmp/bigfile > /dev/scullp0; head -5 /proc/scullpmem Device 0: qset 500, order 0, sz 1048576 item at c4184780, qset at c71c4800 0:c262b000 1:c2193000 rudo% cat /tmp/bigfile > /dev/scullv0; head -5 /proc/scullvmem Device 0: qset 500, order 4, sz 1048576 item at c4184b80, qset at c71c4000 0:c881a000 1:c882b000 rudo% uname -m i686
Allocation at boot time is the only way to retrieve consecutive memory pages while bypassing the limits imposed by get_free_pages on the buffer size, both in terms of maximum allowed size and limited choice of sizes. Allocating memory at boot time is a "dirty'' technique, because it bypasses all memory management policies by reserving a private memory pool.
One noticeable problem with boot-time allocation is that it is not a feasible option for the average user: being only available for code linked in the kernel image, a device driver using this kind of allocation can only be installed or replaced by rebuilding the kernel and rebooting the computer. Fortunately, there are a pair of workarounds to this problem, which we introduce soon.
Even though we won't suggest allocating memory at boot time, it's something worth mentioning because it used to be the only way to allocate a DMA-capable buffer in the first Linux versions, before _ _GFP_DMA was introduced.
Acquiring a Dedicated Buffer at Boot Time
#include <linux/bootmem.h> void *alloc_bootmem(unsigned long size); void *alloc_bootmem_low(unsigned long size); void *alloc_bootmem_pages(unsigned long size); void *alloc_bootmem_low_pages(unsigned long size);
The functions allocate either whole pages (if they end with _pages) or non-page-aligned memory areas. They allocate either low or normal memory (see the discussion of memory zones earlier in this chapter). Normal allocation returns memory addresses that are above MAX_DMA_ADDRESS; low memory is at addresses lower than that value.
This interface was introduced in version 2.3.23 of the kernel. Earlier versions used a less refined interface, similar to the one described in Unix books. Basically, the initialization functions of several kernel subsystems received two unsigned long arguments, which represented the current bounds of the free memory area. Each such function could steal part of this area, returning the new lower bound. A driver allocating memory at boot time, therefore, was able to steal consecutive memory from the linear array of available RAM.
The main problem with this older mechanism of managing boot-time allocation requests was that not all initialization functions could modify the lower memory bound, so writing a driver needing such allocation usually implied providing users with a kernel patch. On the other hand, alloc_bootmem can be called by the initialization function of any kernel subsystem, provided it is performed at boot time.
This way of allocating memory has several disadvantages, not the least being the inability to ever free the buffer. After a driver has taken some memory, it has no way of returning it to the pool of free pages; the pool is created after all the physical allocation has taken place, and we don't recommend hacking the data structures internal to memory management. On the other hand, the advantage of this technique is that it makes available an area of consecutive physical memory that is suitable for DMA. This is currently the only safe way in the standard kernel to allocate a buffer of more than 32 consecutive pages, because the maximum value of order that is accepted by get_free_pages is 5. If, however, you need many pages and they don't have to be physically contiguous, vmalloc is by far the best function to use.
If you are going to resort to grabbing memory at boot time, you must modify init/main.c in the kernel sources. You'll find more about main.c in Chapter 16, "Physical Layout of the Kernel Source".
Note that this "allocation'' can be performed only in multiples of the page size, though the number of pages doesn't have to be a power of two.
The bigphysarea Patch
The functions and symbols related to memory allocation follow.