patch to linux 3.0.39 (Squashed commit) · qsd8k-legacy/android_kernel_htc_htcleo@323d145

Commit

patch to linux 3.0.39 (Squashed commit)

cifs: always update the inode cache with the results from a FIND_*

commit cd60042cc1392e79410dc8de9e9c1abb38a29e57 upstream.

When we get back a FIND_FIRST/NEXT result, we have some info about the
dentry that we use to instantiate a new inode. We were ignoring and
discarding that info when we had an existing dentry in the cache.

Fix this by updating the inode in place when we find an existing dentry
and the uniqueid is the same.

Reported-and-Tested-by: Andrew Bartlett <[email protected]>
Reported-by: Bill Robertson <[email protected]>
Reported-by: Dion Edwards <[email protected]>
Signed-off-by: Jeff Layton <[email protected]>
Signed-off-by: Steve French <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

ntp: Fix STA_INS/DEL clearing bug

commit 6b1859dba01c7d512b72d77e3fd7da8354235189 upstream.

In commit 6b43ae8a619d17c4935c3320d2ef9e92bdeed05d, I
introduced a bug that kept the STA_INS or STA_DEL bit
from being cleared from time_status via adjtimex()
without forcing STA_PLL first.

Usually once the STA_INS is set, it isn't cleared
until the leap second is applied, so its unlikely this
affected anyone. However during testing I noticed it
took some effort to cancel a leap second once STA_INS
was set.

Signed-off-by: John Stultz <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Richard Cochran <[email protected]>
Cc: Prarit Bhargava <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

mm: fix lost kswapd wakeup in kswapd_stop()

commit 1c7e7f6c0703d03af6bcd5ccc11fc15d23e5ecbe upstream.

Offlining memory may block forever, waiting for kswapd() to wake up
because kswapd() does not check the event kthread->should_stop before
sleeping.

The proper pattern, from Documentation/memory-barriers.txt, is:

   ---  waker  ---
   event_indicated = 1;
   wake_up_process(event_daemon);

   ---  sleeper  ---
   for (;;) {
      set_current_state(TASK_UNINTERRUPTIBLE);
      if (event_indicated)
         break;
      schedule();
   }

   set_current_state() may be wrapped by:
      prepare_to_wait();

In the kswapd() case, event_indicated is kthread->should_stop.

  === offlining memory (waker) ===
   kswapd_stop()
      kthread_stop()
         kthread->should_stop = 1
         wake_up_process()
         wait_for_completion()

  ===  kswapd_try_to_sleep (sleeper) ===
   kswapd_try_to_sleep()
      prepare_to_wait()
           .
           .
      schedule()
           .
           .
      finish_wait()

The schedule() needs to be protected by a test of kthread->should_stop,
which is wrapped by kthread_should_stop().

Reproducer:
   Do heavy file I/O in background.
   Do a memory offline/online in a tight loop

Signed-off-by: Aaditya Kumar <[email protected]>
Acked-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

MIPS: Properly align the .data..init_task section.

commit 7b1c0d26a8e272787f0f9fcc5f3e8531df3b3409 upstream.

Improper alignment can lead to unbootable systems and/or random
crashes.

[[email protected]: This is a lond standing bug since
6eb10bc9e2deab06630261cd05c4cb1e9a60e980 (kernel.org) rsp.
c422a10917f75fd19fa7fe070aaaa23e384dae6f (lmo) [MIPS: Clean up linker script
using new linker script macros.] so dates back to 2.6.32.]

Signed-off-by: David Daney <[email protected]>
Cc: [email protected]
Patchwork: https://patchwork.linux-mips.org/patch/3881/
Signed-off-by: Ralf Baechle <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

UBIFS: fix a bug in empty space fix-up

commit c6727932cfdb13501108b16c38463c09d5ec7a74 upstream.

UBIFS has a feature called "empty space fix-up" which is a quirk to work-around
limitations of dumb flasher programs. Namely, of those flashers that are unable
to skip NAND pages full of 0xFFs while flashing, resulting in empty space at
the end of half-filled eraseblocks to be unusable for UBIFS. This feature is
relatively new (introduced in v3.0).

The fix-up routine (fixup_free_space()) is executed only once at the very first
mount if the superblock has the 'space_fixup' flag set (can be done with -F
option of mkfs.ubifs). It basically reads all the UBIFS data and metadata and
writes it back to the same LEB. The routine assumes the image is pristine and
does not have anything in the journal.

There was a bug in 'fixup_free_space()' where it fixed up the log incorrectly.
All but one LEB of the log of a pristine file-system are empty. And one
contains just a commit start node. And 'fixup_free_space()' just unmapped this
LEB, which resulted in wiping the commit start node. As a result, some users
were unable to mount the file-system next time with the following symptom:

UBIFS error (pid 1): replay_log_leb: first log node at LEB 3:0 is not CS node
UBIFS error (pid 1): replay_log_leb: log error detected while replaying the log at LEB 3:0

The root-cause of this bug was that 'fixup_free_space()' wrongly assumed
that the beginning of empty space in the log head (c->lhead_offs) was known
on mount. However, it is not the case - it was always 0. UBIFS does not store
in it the master node and finds out by scanning the log on every mount.

The fix is simple - just pass commit start node size instead of 0 to
'fixup_leb()'.

Signed-off-by: Artem Bityutskiy <[email protected]>
Reported-by: Iwo Mergler <[email protected]>
Tested-by: Iwo Mergler <[email protected]>
Reported-by: James Nute <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

dm raid1: fix crash with mirror recovery and discard

commit 751f188dd5ab95b3f2b5f2f467c38aae5a2877eb upstream.

This patch fixes a crash when a discard request is sent during mirror
recovery.

Firstly, some background.  Generally, the following sequence happens during
mirror synchronization:
- function do_recovery is called
- do_recovery calls dm_rh_recovery_prepare
- dm_rh_recovery_prepare uses a semaphore to limit the number
  simultaneously recovered regions (by default the semaphore value is 1,
  so only one region at a time is recovered)
- dm_rh_recovery_prepare calls __rh_recovery_prepare,
  __rh_recovery_prepare asks the log driver for the next region to
  recover. Then, it sets the region state to DM_RH_RECOVERING. If there
  are no pending I/Os on this region, the region is added to
  quiesced_regions list. If there are pending I/Os, the region is not
  added to any list. It is added to the quiesced_regions list later (by
  dm_rh_dec function) when all I/Os finish.
- when the region is on quiesced_regions list, there are no I/Os in
  flight on this region. The region is popped from the list in
  dm_rh_recovery_start function. Then, a kcopyd job is started in the
  recover function.
- when the kcopyd job finishes, recovery_complete is called. It calls
  dm_rh_recovery_end. dm_rh_recovery_end adds the region to
  recovered_regions or failed_recovered_regions list (depending on
  whether the copy operation was successful or not).

The above mechanism assumes that if the region is in DM_RH_RECOVERING
state, no new I/Os are started on this region. When I/O is started,
dm_rh_inc_pending is called, which increases reg->pending count. When
I/O is finished, dm_rh_dec is called. It decreases reg->pending count.
If the count is zero and the region was in DM_RH_RECOVERING state,
dm_rh_dec adds it to the quiesced_regions list.

Consequently, if we call dm_rh_inc_pending/dm_rh_dec while the region is
in DM_RH_RECOVERING state, it could be added to quiesced_regions list
multiple times or it could be added to this list when kcopyd is copying
data (it is assumed that the region is not on any list while kcopyd does
its jobs). This results in memory corruption and crash.

There already exist bypasses for REQ_FLUSH requests: REQ_FLUSH requests
do not belong to any region, so they are always added to the sync list
in do_writes. dm_rh_inc_pending does not increase count for REQ_FLUSH
requests. In mirror_end_io, dm_rh_dec is never called for REQ_FLUSH
requests. These bypasses avoid the crash possibility described above.

These bypasses were improperly implemented for REQ_DISCARD when
the mirror target gained discard support in commit
5fc2ffeabb9ee0fc0e71ff16b49f34f0ed3d05b4 (dm raid1: support discard).

In do_writes, REQ_DISCARD requests is always added to the sync queue and
immediately dispatched (even if the region is in DM_RH_RECOVERING).  However,
dm_rh_inc and dm_rh_dec is called for REQ_DISCARD resusts.  So it violates the
rule that no I/Os are started on DM_RH_RECOVERING regions, and causes the list
corruption described above.

This patch changes it so that REQ_DISCARD requests follow the same path
as REQ_FLUSH. This avoids the crash.

Reference: https://bugzilla.redhat.com/837607

Signed-off-by: Mikulas Patocka <[email protected]>
Signed-off-by: Alasdair G Kergon <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

mm/vmstat.c: cache align vm_stat

commit a1cb2c60ddc98ff4e5246f410558805401ceee67 upstream.

Stable note: Not tracked on Bugzilla. This patch is known to make a big
        difference to tmpfs performance on larger machines.

This was found to adversely affect tmpfs I/O performance.

Tests run on a 640 cpu UV system.

With 120 threads doing parallel writes, each to different tmpfs mounts:
No patch:		~300 MB/sec
With vm_stat alignment:	~430 MB/sec

Signed-off-by: Dimitri Sivanich <[email protected]>
Acked-by: Christoph Lameter <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>

mm: memory hotplug: Check if pages are correctly reserved on a per-section basis

commit 2bbcb8788311a40714b585fc11b51da6ffa2ab92 upstream.

Stable note: Fixes https://bugzilla.novell.com/show_bug.cgi?id=721039 .
        Without the patch, memory hot-add can fail for kernel configurations
        that do not set CONFIG_SPARSEMEM_VMEMMAP.

(Resending as I am not seeing it in -next so maybe it got lost)

mm: memory hotplug: Check if pages are correctly reserved on a per-section basis

It is expected that memory being brought online is PageReserved
similar to what happens when the page allocator is being brought up.
Memory is onlined in "memory blocks" which consist of one or more
sections. Unfortunately, the code that verifies PageReserved is
currently assuming that the memmap backing all these pages is virtually
contiguous which is only the case when CONFIG_SPARSEMEM_VMEMMAP is set.
As a result, memory hot-add is failing on those configurations with
the message;

kernel: section number XXX page number 256 not reserved, was it already online?

This patch updates the PageReserved check to lookup struct page once
per section to guarantee the correct struct page is being checked.

[Check pages within sections properly: [email protected]]
[original patch by: [email protected]]
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Tested-by: Nathan Fontenot <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

mm: reduce the amount of work done when updating min_free_kbytes

commit 938929f14cb595f43cd1a4e63e22d36cab1e4a1f upstream.

Stable note: Fixes https://bugzilla.novell.com/show_bug.cgi?id=726210 .
        Large machines with 1TB or more of RAM take a long time to boot
        without this patch and may spew out soft lockup warnings.

When min_free_kbytes is updated, some pageblocks are marked
MIGRATE_RESERVE.  Ordinarily, this work is unnoticable as it happens early
in boot but on large machines with 1TB of memory, this has been reported
to delay boot times, probably due to the NUMA distances involved.

The bulk of the work is due to calling calling pageblock_is_reserved() an
unnecessary amount of times and accessing far more struct page metadata
than is necessary.  This patch significantly reduces the amount of work
done by setup_zone_migrate_reserve() improving boot times on 1TB machines.

[[email protected]: coding-style fixes]
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

mm: vmscan: fix force-scanning small targets without swap

commit a4d3e9e76337059406fcf3ead288c0df22a790e9 upstream.

vmscan: clear ZONE_CONGESTED for zone with good watermark

commit 439423f6894aa0dec22187526827456f5004baed upstream.

Stable note: Not tracked in Bugzilla. kswapd is responsible for clearing
	ZONE_CONGESTED after it balances a zone and this patch fixes a bug
	where that was failing to happen. Without this patch, processes
	can stall in wait_iff_congested unnecessarily. For users, this can
	look like an interactivity stall but some workloads would see it
	as sudden drop in throughput.

ZONE_CONGESTED is only cleared in kswapd, but pages can be freed in any
task.  It's possible ZONE_CONGESTED isn't cleared in some cases:

 1. the zone is already balanced just entering balance_pgdat() for
    order-0 because concurrent tasks free memory.  In this case, later
    check will skip the zone as it's balanced so the flag isn't cleared.

 2. high order balance fallbacks to order-0.  quote from Mel: At the
    end of balance_pgdat(), kswapd uses the following logic;

	If reclaiming at high order {
		for each zone {
			if all_unreclaimable
				skip
			if watermark is not met
				order = 0
				loop again

			/* watermark is met */
			clear congested
		}
	}

    i.e. it clears ZONE_CONGESTED if it the zone is balanced.  if not,
    it restarts balancing at order-0.  However, if the higher zones are
    balanced for order-0, kswapd will miss clearing ZONE_CONGESTED as
    that only happens after a zone is shrunk.  This can mean that
    wait_iff_congested() stalls unnecessarily.

This patch makes kswapd clear ZONE_CONGESTED during its initial
highmem->dma scan for zones that are already balanced.

Signed-off-by: Shaohua Li <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

vmscan: prepare for consistency with upstream

vmscan: add shrink_slab tracepoints

commit 095760730c1047c69159ce88021a7fa3833502c8 upstream.

Stable note: This patch makes later patches easier to apply but otherwise
        has little to justify it. It is a diagnostic patch that was part
        of a series addressing excessive slab shrinking after GFP_NOFS
        failures. There is detailed information on the series' motivation
        at https://lkml.org/lkml/2011/6/2/42 .

It is impossible to understand what the shrinkers are actually doing
without instrumenting the code, so add a some tracepoints to allow
insight to be gained.

Signed-off-by: Dave Chinner <[email protected]>
Signed-off-by: Al Viro <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>

vmscan: shrinker->nr updates race and go wrong

commit acf92b485cccf028177f46918e045c0c4e80ee10 upstream.

Stable note: Not tracked in Bugzilla. This patch reduces excessive
	reclaim of slab objects reducing the amount of information
	that has to be brought back in from disk.

shrink_slab() allows shrinkers to be called in parallel so the
struct shrinker can be updated concurrently. It does not provide any
exclusio for such updates, so we can get the shrinker->nr value
increasing or decreasing incorrectly.

As a result, when a shrinker repeatedly returns a value of -1 (e.g.
a VFS shrinker called w/ GFP_NOFS), the shrinker->nr goes haywire,
sometimes updating with the scan count that wasn't used, sometimes
losing it altogether. Worse is when a shrinker does work and that
update is lost due to racy updates, which means the shrinker will do
the work again!

Fix this by making the total_scan calculations independent of
shrinker->nr, and making the shrinker->nr updates atomic w.r.t. to
other updates via cmpxchg loops.

Signed-off-by: Dave Chinner <[email protected]>
Signed-off-by: Al Viro <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

vmscan: reduce wind up shrinker->nr when shrinker can't do work

commit 3567b59aa80ac4417002bf58e35dce5c777d4164 upstream.

Stable note: Not tracked in Bugzilla. This patch reduces excessive
	reclaim of slab objects reducing the amount of information that
	has to be brought back in from disk. The third and fourth paragram
	in the series describes the impact.

When a shrinker returns -1 to shrink_slab() to indicate it cannot do
any work given the current memory reclaim requirements, it adds the
entire total_scan count to shrinker->nr. The idea ehind this is that
whenteh shrinker is next called and can do work, it will do the work
of the previously aborted shrinker call as well.

However, if a filesystem is doing lots of allocation with GFP_NOFS
set, then we get many, many more aborts from the shrinkers than we
do successful calls. The result is that shrinker->nr winds up to
it's maximum permissible value (twice the current cache size) and
then when the next shrinker call that can do work is issued, it
has enough scan count built up to free the entire cache twice over.

This manifests itself in the cache going from full to empty in a
matter of seconds, even when only a small part of the cache is
needed to be emptied to free sufficient memory.

Under metadata intensive workloads on ext4 and XFS, I'm seeing the
VFS caches increase memory consumption up to 75% of memory (no page
cache pressure) over a period of 30-60s, and then the shrinker
empties them down to zero in the space of 2-3s. This cycle repeats
over and over again, with the shrinker completely trashing the inode
and dentry caches every minute or so the workload continues.

This behaviour was made obvious by the shrink_slab tracepoints added
earlier in the series, and made worse by the patch that corrected
the concurrent accounting of shrinker->nr.

To avoid this problem, stop repeated small increments of the total
scan value from winding shrinker->nr up to a value that can cause
the entire cache to be freed. We still need to allow it to wind up,
so use the delta as the "large scan" threshold check - if the delta
is more than a quarter of the entire cache size, then it is a large
scan and allowed to cause lots of windup because we are clearly
needing to free lots of memory.

If it isn't a large scan then limit the total scan to half the size
of the cache so that windup never increases to consume the whole
cache. Reducing the total scan limit further does not allow enough
wind-up to maintain the current levels of performance, whilst a
higher threshold does not prevent the windup from freeing the entire
cache under sustained workloads.

Signed-off-by: Dave Chinner <[email protected]>
Signed-off-by: Al Viro <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

vmscan: limit direct reclaim for higher order allocations

commit e0887c19b2daa140f20ca8104bdc5740f39dbb86 upstream.

Stable note: Not tracked on Bugzilla. THP and compaction was found to
	aggressively reclaim pages and stall systems under different
	situations that was addressed piecemeal over time.  Paragraph
	3 of this changelog is the motivation for this patch.

When suffering from memory fragmentation due to unfreeable pages, THP page
faults will repeatedly try to compact memory.  Due to the unfreeable
pages, compaction fails.

Needless to say, at that point page reclaim also fails to create free
contiguous 2MB areas.  However, that doesn't stop the current code from
trying, over and over again, and freeing a minimum of 4MB (2UL <<
sc->order pages) at every single invocation.

This resulted in my 12GB system having 2-3GB free memory, a corresponding
amount of used swap and very sluggish response times.

This can be avoided by having the direct reclaim code not reclaim from
zones that already have plenty of free memory available for compaction.

If compaction still fails due to unmovable memory, doing additional
reclaim will only hurt the system, not help.

[[email protected]: change comment to explain the order check]
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

vmscan: abort reclaim/compaction if compaction can proceed

commit e0c23279c9f800c403f37511484d9014ac83adec upstream.

Stable note: Not tracked on Bugzilla. THP and compaction was found to
	aggressively reclaim pages and stall systems under different
	situations that was addressed piecemeal over time.

If compaction can proceed, shrink_zones() stops doing any work but its
callers still call shrink_slab() which raises the priority and potentially
sleeps.  This is unnecessary and wasteful so this patch aborts direct
reclaim/compaction entirely if compaction can proceed.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Josh Boyer <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

mm: change isolate mode from #define to bitwise type

commit 4356f21d09283dc6d39a6f7287a65ddab61e2808 upstream.

Stable note: Not tracked in Bugzilla. This patch makes later patches
	easier to apply but has no other impact.

Change ISOLATE_XXX macro with bitwise isolate_mode_t type.  Normally,
macro isn't recommended as it's type-unsafe and making debugging harder as
symbol cannot be passed throught to the debugger.

Quote from Johannes
" Hmm, it would probably be cleaner to fully convert the isolation mode
into independent flags.  INACTIVE, ACTIVE, BOTH is currently a
tri-state among flags, which is a bit ugly."

This patch moves isolate mode from swap.h to mmzone.h by memcontrol.h

Signed-off-by: Minchan Kim <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

mm: compaction: make isolate_lru_page() filter-aware

commit 39deaf8585152f1a35c1676d3d7dc6ae0fb65967 upstream.

Stable note: Not tracked in Bugzilla. THP and compaction disrupt the LRU
	list leading to poor reclaim decisions which has a variable
	performance impact.

In async mode, compaction doesn't migrate dirty or writeback pages.  So,
it's meaningless to pick the page and re-add it to lru list.

Of course, when we isolate the page in compaction, the page might be dirty
or writeback but when we try to migrate the page, the page would be not
dirty, writeback.  So it could be migrated.  But it's very unlikely as
isolate and migration cycle is much faster than writeout.

So, this patch helps cpu overhead and prevent unnecessary LRU churning.

Signed-off-by: Minchan Kim <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

mm: zone_reclaim: make isolate_lru_page() filter-aware

commit f80c0673610e36ae29d63e3297175e22f70dde5f upstream.

Stable note: Not tracked in Bugzilla. THP and compaction disrupt the LRU list
	leading to poor reclaim decisions which has a variable
	performance impact.

In __zone_reclaim case, we don't want to shrink mapped page.  Nonetheless,
we have isolated mapped page and re-add it into LRU's head.  It's
unnecessary CPU overhead and makes LRU churning.

Of course, when we isolate the page, the page might be mapped but when we
try to migrate the page, the page would be not mapped.  So it could be
migrated.  But race is rare and although it happens, it's no big deal.

Signed-off-by: Minchan Kim <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

mm: migration: clean up unmap_and_move()

commit 0dabec93de633a87adfbbe1d800a4c56cd19d73b upstream.

Stable note: Not tracked in Bugzilla. This patch makes later patches
	easier to apply but has no other impact.

unmap_and_move() is one a big messy function.  Clean it up.

Signed-off-by: Minchan Kim <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

mm: compaction: allow compaction to isolate dirty pages

commit a77ebd333cd810d7b680d544be88c875131c2bd3 upstream.

mm: compaction: determine if dirty pages can be migrated without blocking within ->migratepage

commit b969c4ab9f182a6e1b2a0848be349f99714947b0 upstream.

Stable note: Not tracked in Bugzilla. A fix aimed at preserving page
	aging information by reducing LRU list churning had the side-effect
	of reducing THP allocation success rates. This was part of a series
	to restore the success rates while preserving the reclaim fix.

Asynchronous compaction is used when allocating transparent hugepages to
avoid blocking for long periods of time.  Due to reports of stalling,
there was a debate on disabling synchronous compaction but this severely
impacted allocation success rates.  Part of the reason was that many dirty
pages are skipped in asynchronous compaction by the following check;

	if (PageDirty(page) && !sync &&
		mapping->a_ops->migratepage != migrate_page)
			rc = -EBUSY;

This skips over all mapping aops using buffer_migrate_page() even though
it is possible to migrate some of these pages without blocking.  This
patch updates the ->migratepage callback with a "sync" parameter.  It is
the responsibility of the callback to fail gracefully if migration would
block.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Andy Isaacson <[email protected]>
Cc: Nai Xia <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

mm: page allocator: do not call direct reclaim for THP allocations while compaction is deferred

commit 66199712e9eef5aede09dbcd9dfff87798a66917 upstream.

Stable note: Not tracked in Buzilla. This was part of a series that
	reduced interactivity stalls experienced when THP was enabled.

If compaction is deferred, direct reclaim is used to try to free enough
pages for the allocation to succeed.  For small high-orders, this has a
reasonable chance of success.  However, if the caller has specified
__GFP_NO_KSWAPD to limit the disruption to the system, it makes more sense
to fail the allocation rather than stall the caller in direct reclaim.
This patch skips direct reclaim if compaction is deferred and the caller
specifies __GFP_NO_KSWAPD.

Async compaction only considers a subset of pages so it is possible for
compaction to be deferred prematurely and not enter direct reclaim even in
cases where it should.  To compensate for this, this patch also defers
compaction only if sync compaction failed.

Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Reviewed-by: Rik van Riel<[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Andy Isaacson <[email protected]>
Cc: Nai Xia <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

mm: compaction: make isolate_lru_page() filter-aware again

commit c82449352854ff09e43062246af86bdeb628f0c3 upstream.

Stable note: Not tracked in Bugzilla. A fix aimed at preserving page aging
	information by reducing LRU list churning had the side-effect of
	reducing THP allocation success rates. This was part of a series
	to restore the success rates while preserving the reclaim fix.

Commit 39deaf85 ("mm: compaction: make isolate_lru_page() filter-aware")
noted that compaction does not migrate dirty or writeback pages and that
is was meaningless to pick the page and re-add it to the LRU list.  This
had to be partially reverted because some dirty pages can be migrated by
compaction without blocking.

This patch updates "mm: compaction: make isolate_lru_page" by skipping
over pages that migration has no possibility of migrating to minimise LRU
disruption.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Rik van Riel<[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Andy Isaacson <[email protected]>
Cc: Nai Xia <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

kswapd: avoid unnecessary rebalance after an unsuccessful balancing

commit d2ebd0f6b89567eb93ead4e2ca0cbe03021f344b upstream.

Stable note: Fixes https://bugzilla.redhat.com/show_bug.cgi?id=712019.  This
	patch reduces kswapd CPU usage.

In commit 215ddd66 ("mm: vmscan: only read new_classzone_idx from pgdat
when reclaiming successfully") , Mel Gorman said kswapd is better to sleep
after a unsuccessful balancing if there is tighter reclaim request pending
in the balancing.  But in the following scenario, kswapd do something that
is not matched our expectation.  The patch fixes this issue.

1, Read pgdat request A (classzone_idx, order = 3)
2, balance_pgdat()
3, During pgdat, a new pgdat request B (classzone_idx, order = 5) is placed
4, balance_pgdat() returns but failed since returned order = 0
5, pgdat of request A assigned to balance_pgdat(), and do balancing again.
   While the expectation behavior of kswapd should try to sleep.

Signed-off-by: Alex Shi <[email protected]>
Reviewed-by: Tim Chen <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Tested-by: PÃ¡draig Brady <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

kswapd: assign new_order and new_classzone_idx after wakeup in sleeping

commit f0dfcde099453aa4c0dc42473828d15a6d492936 upstream.

Stable note: Fixes https://bugzilla.redhat.com/show_bug.cgi?id=712019.  This
	patch reduces kswapd CPU usage.

There 2 places to read pgdat in kswapd.  One is return from a successful
balance, another is waked up from kswapd sleeping.  The new_order and
new_classzone_idx represent the balance input order and classzone_idx.

But current new_order and new_classzone_idx are not assigned after
kswapd_try_to_sleep(), that will cause a bug in the following scenario.

1: after a successful balance, kswapd goes to sleep, and new_order = 0;
   new_classzone_idx = __MAX_NR_ZONES - 1;

2: kswapd waked up with order = 3 and classzone_idx = ZONE_NORMAL

3: in the balance_pgdat() running, a new balance wakeup happened with
   order = 5, and classzone_idx = ZONE_NORMAL

4: the first wakeup(order = 3) finished successufly, return order = 3
   but, the new_order is still 0, so, this balancing will be treated as a
   failed balance.  And then the second tighter balancing will be missed.

So, to avoid the above problem, the new_order and new_classzone_idx need
to be assigned for later successful comparison.

Signed-off-by: Alex Shi <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Tested-by: PÃ¡draig Brady <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

mm: compaction: introduce sync-light migration for use by compaction

commit a6bc32b899223a877f595ef9ddc1e89ead5072b8 upstream.

Stable note: Not tracked in Buzilla. This was part of a series that
	reduced interactivity stalls experienced when THP was enabled.
	These stalls were particularly noticable when copying data
	to a USB stick but the experiences for users varied a lot.

This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
mode that avoids writing back pages to backing storage.  Async compaction
maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
used.

This avoids sync compaction stalling for an excessive length of time,
particularly when copying files to a USB stick where there might be a
large number of dirty pages backed by a filesystem that does not support
->writepages.

[[email protected]: This patch is heavily based on Andrea's work]
[[email protected]: fix fs/nfs/write.c build]
[[email protected]: fix fs/btrfs/disk-io.c build]
Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Andy Isaacson <[email protected]>
Cc: Nai Xia <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

mm: vmscan: when reclaiming for compaction, ensure there are sufficient free pages available

commit fe4b1b244bdb96136855f2c694071cb09d140766 upstream.

Stable note: Not tracked on Bugzilla. THP and compaction was found to
	aggressively reclaim pages and stall systems under different
	situations that was addressed piecemeal over time. This patch
	addresses a problem where the fix regressed THP allocation
	success rates.

In commit e0887c19 ("vmscan: limit direct reclaim for higher order
allocations"), Rik noted that reclaim was too aggressive when THP was
enabled.  In his initial patch he used the number of free pages to decide
if reclaim should abort for compaction.  My feedback was that reclaim and
compaction should be using the same logic when deciding if reclaim should
be aborted.

Unfortunately, this had the effect of reducing THP success rates when the
workload included something like streaming reads that continually
allocated pages.  The window during which compaction could run and return
a THP was too small.

This patch combines Rik's two patches together.  compaction_suitable() is
still used to decide if reclaim should be aborted to allow compaction is
used.  However, it will also ensure that there is a reasonable buffer of
free pages available.  This improves upon the THP allocation success rates
but bounds the number of pages that are freed for compaction.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Rik van Riel<[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Andy Isaacson <[email protected]>
Cc: Nai Xia <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

mm: vmscan: do not OOM if aborting reclaim to start compaction

commit 7335084d446b83cbcb15da80497d03f0c1dc9e21 upstream.

Stable note: Not tracked in Bugzilla. This patch makes later patches
	easier to apply but otherwise has little to justify it. The
	problem it fixes was never observed but the source of the
	theoretical problem did not exist for very long.

During direct reclaim it is possible that reclaim will be aborted so that
compaction can be attempted to satisfy a high-order allocation.  If this
decision is made before any pages are reclaimed, it is possible that 0 is
returned to the page allocator potentially triggering an OOM.  This has
not been observed but it is a possibility so this patch addresses it.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Andy Isaacson <[email protected]>
Cc: Nai Xia <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

mm: vmscan: check if reclaim should really abort even if compaction_ready() is true for one zone

commit 0cee34fd72c582b4f8ad8ce00645b75fb4168199 upstream.

Stable note: Not tracked on Bugzilla. THP and compaction was found to
	aggressively reclaim pages and stall systems under different
	situations that was addressed piecemeal over time.

If compaction can proceed for a given zone, shrink_zones() does not
reclaim any more pages from it.  After commit [e0c2327: vmscan: abort
reclaim/compaction if compaction can proceed], do_try_to_free_pages()
tries to finish as soon as possible once one zone can compact.

This was intended to prevent slabs being shrunk unnecessarily but there
are side-effects.  One is that a small zone that is ready for compaction
will abort reclaim even if the chances of successfully allocating a THP
from that zone is small.  It also means that reclaim can return too early
even though sc->nr_to_reclaim pages were not reclaimed.

This partially reverts the commit until it is proven that slabs are really
being shrunk unnecessarily but preserves the check to return 1 to avoid
OOM if reclaim was aborted prematurely.

[[email protected]: This patch replaces a revert from Andrea]
Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Andy Isaacson <[email protected]>
Cc: Nai Xia <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

vmscan: promote shared file mapped pages

commit 34dbc67a644f11ab3475d822d72e25409911e760 upstream.

Stable note: Not tracked in Bugzilla. There were reports of shared
	mapped pages being unfairly reclaimed in comparison to older kernels.
	This is being addressed over time. The specific workload being
	addressed here in described in paragraph four and while paragraph
	five says it did not help performance as such, it made a difference
	to major page faults. I'm aware of at least one bug for a large
	vendor that was due to increased major faults.

Commit 645747462435 ("vmscan: detect mapped file pages used only once")
greatly decreases lifetime of single-used mapped file pages.
Unfortunately it also decreases life time of all shared mapped file
pages.  Because after commit bf3f3bc5e7347 ("mm: don't mark_page_accessed
in fault path") page-fault handler does not mark page active or even
referenced.

Thus page_check_references() activates file page only if it was used twice
while it stays in inactive list, meanwhile it activates anon pages after
first access.  Inactive list can be small enough, this way reclaimer can
accidentally throw away any widely used page if it wasn't used twice in
short period.

After this patch page_check_references() also activate file mapped page at
first inactive list scan if this page is already used multiple times via
several ptes.

I found this while trying to fix degragation in rhel6 (~2.6.32) from rhel5
(~2.6.18).  There a complete mess with >100 web/mail/spam/ftp containers,
they share all their files but there a lot of anonymous pages: ~500mb
shared file mapped memory and 15-20Gb non-shared anonymous memory.  In
this situation major-pagefaults are very costly, because all containers
share the same page.  In my load kernel created a disproportionate
pressure on the file memory, compared with the anonymous, they equaled
only if I raise swappiness up to 150 =)

These patches actually wasn't helped a lot in my problem, but I saw
noticable (10-20 times) reduce in count and average time of
major-pagefault in file-mapped areas.

Actually both patches are fixes for commit v2.6.33-5448-g6457474, because
it was aimed at one scenario (singly used pages), but it breaks the logic
in other scenarios (shared and/or executable pages)

Signed-off-by: Konstantin Khlebnikov <[email protected]>
Acked-by: Pekka Enberg <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>

vmscan: activate executable pages after first usage

commit c909e99364c8b6ca07864d752950b6b4ecf6bef4 upstream.

Stable note: Not tracked in Bugzilla. There were reports of shared
	mapped pages being unfairly reclaimed in comparison to older kernels.
	This is being addressed over time.

Logic added in commit 8cab4754d24a0 ("vmscan: make mapped executable pages
the first class citizen") was noticeably weakened in commit
645747462435d84 ("vmscan: detect mapped file pages used only once").

Currently these pages can become "first class citizens" only after second
usage.  After this patch page_check_references() will activate they after
first usage, and executable code gets yet better chance to stay in memory.

Signed-off-by: Konstantin Khlebnikov <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

mm/vmscan.c: consider swap space when deciding whether to continue reclaim

commit 86cfd3a45042ab242d47f3935a02811a402beab6 upstream.

Stable note: Not tracked in Bugzilla. This patch reduces kswapd CPU
	usage on swapless systems with high anonymous memory usage.

It's pointless to continue reclaiming when we have no swap space and lots
of anon pages in the inactive list.

Without this patch, it is possible when swap is disabled to continue
trying to reclaim when there are only anonymous pages in the system even
though that will not make any progress.

Signed-off-by: Minchan Kim <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

mm: test PageSwapBacked in lumpy reclaim

commit 043bcbe5ec51e0478ef2b44acef17193e01d7f70 upstream.

Stable note: Not tracked in Bugzilla. There were reports of shared
	mapped pages being unfairly reclaimed in comparison to older kernels.
	This is being addressed over time. Even though the subject
	refers to lumpy reclaim, it impacts compaction as well.

Lumpy reclaim does well to stop at a PageAnon when there's no swap, but
better is to stop at any PageSwapBacked, which includes shmem/tmpfs too.

Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>

mm: vmscan: convert global reclaim to per-memcg LRU lists

commit b95a2f2d486d0d768a92879c023a03757b9c7e58 upstream - WARNING: this is a substitute patch.

Stable note: Not tracked in Bugzilla. This is a partial backport of an
	upstream commit addressing a completely different issue
	that accidentally contained an important fix. The workload
	this patch helps was memcached when IO is started in the
	background. memcached should stay resident but without this patch
	it gets swapped. Sometimes this manifests as a drop in throughput
	but mostly it was observed through /proc/vmstat.

Commit [246e87a9: memcg: fix get_scan_count() for small targets] was meant
to fix a problem whereby small scan targets on memcg were ignored causing
priority to raise too sharply. It forced scanning to take place if the
target was small, memcg or kswapd.

From the time it was introduced it caused excessive reclaim by kswapd
with workloads being pushed to swap that previously would have stayed
resident. This was accidentally fixed in commit [b95a2f2d: mm: vmscan:
convert global reclaim to per-memcg LRU lists] by making it harder for
kswapd to force scan small targets but that patchset is not suitable for
backporting. This was later changed again by commit [90126375: mm/vmscan:
push lruvec pointer into get_scan_count()] into a format that looks
like it would be a straight-forward backport but there is a subtle
difference due to the use of lruvecs.

The impact of the accidental fix is to make it harder for kswapd to force
scan small targets by taking zone->all_unreclaimable into account. This
patch is the closest equivalent available based on what is backported.

Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

cpusets: avoid looping when storing to mems_allowed if one node remains set

commit 89e8a244b97e48f1f30e898b6f32acca477f2a13 upstream.

Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is
	extremely expensive and severely impacted page allocator performance.
	This is part of a series of patches that reduce page allocator
	overhead.

{get,put}_mems_allowed() exist so that general kernel code may locklessly
access a task's set of allowable nodes without having the chance that a
concurrent write will cause the nodemask to be empty on configurations
where MAX_NUMNODES > BITS_PER_LONG.

This could incur a significant delay, however, especially in low memory
conditions because the page allocator is blocking and reclaim requires
get_mems_allowed() itself.  It is not atypical to see writes to
cpuset.mems take over 2 seconds to complete, for example.  In low memory
conditions, this is problematic because it's one of the most imporant
times to change cpuset.mems in the first place!

The only way a task's set of allowable nodes may change is through cpusets
by writing to cpuset.mems and when attaching a task to a generic code is
not reading the nodemask with get_mems_allowed() at the same time, and
then clearing all the old nodes.  This prevents the possibility that a
reader will see an empty nodemask at the same time the writer is storing a
new nodemask.

If at least one node remains unchanged, though, it's possible to simply
set all new nodes and then clear all the old nodes.  Changing a task's
nodemask is protected by cgroup_mutex so it's guaranteed that two threads
are not changing the same task's nodemask at the same time, so the
nodemask is guaranteed to be stored before another thread changes it and
determines whether a node remains set or not.

Signed-off-by: David Rientjes <[email protected]>
Cc: Miao Xie <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Paul Menage <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask

commit b246272ecc5ac68c743b15c9e41a2275f7ce70e2 upstream.

Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely
	expensive and severely impacted page allocator performance. This is
	part of a series of patches that reduce page allocator overhead.

Kernels where MAX_NUMNODES > BITS_PER_LONG may temporarily see an empty
nodemask in a tsk's mempolicy if its previous nodemask is remapped onto a
new set of allowed cpuset nodes where the two nodemasks, as a result of
the remap, are now disjoint.

c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing
cpuset's mems") adds get_mems_allowed() to prevent the set of allowed
nodes from changing for a thread.  This causes any update to a set of
allowed nodes to stall until put_mems_allowed() is called.

This stall is unncessary, however, if at least one node remains unchanged
in the update to the set of allowed nodes.  This was addressed by
89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if one
node remains set"), but it's still possible that an empty nodemask may be
read from a mempolicy because the old nodemask may be remapped to the new
nodemask during rebind.  To prevent this, only avoid the stall if there is
no mempolicy for the thread being changed.

This is a temporary solution until all reads from mempolicy nodemasks can
be guaranteed to not be empty without the get_mems_allowed()
synchronization.

Also moves the check for nodemask intersection inside task_lock() so that
tsk->mems_allowed cannot change.  This ensures that nothing can set this
tsk's mems_allowed out from under us and also protects tsk->mempolicy.

Reported-by: Miao Xie <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Paul Menage <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

cpuset: mm: reduce large amounts of memory barrier related damage v3

commit cc9a6c8776615f9c194ccf0b63a0aa5628235545 upstream.

Stable note:  Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely
	expensive and severely impacted page allocator performance. This
	is part of a series of patches that reduce page allocator overhead.

Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.

[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths.  This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32.  The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.

For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.

This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side.  This is much cheaper on some architectures, including x86.  The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.

While updating the nodemask, a check is made to see if a false failure
is a risk.  If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.

In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%.  The
actual results were

                             3.3.0-rc3          3.3.0-rc3
                             rc3-vanilla        nobarrier-v2r1
    Clients   1 UserTime       0.07 (  0.00%)   0.08 (-14.19%)
    Clients   2 UserTime       0.07 (  0.00%)   0.07 (  2.72%)
    Clients   4 UserTime       0.08 (  0.00%)   0.07 (  3.29%)
    Clients   1 SysTime        0.70 (  0.00%)   0.65 (  6.65%)
    Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.65%)
    Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.32%)
    Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
    Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.73%)
    Clients   4 WallTime       0.38 (  0.00%)   0.37 (  1.58%)
    Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 520294.53 (  4.56%)
    Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429882.01 (  3.68%)
    Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258761.48 (  0.31%)
    Clients   1 Flt/sec      495161.39 (  0.00%) 517292.87 (  4.47%)
    Clients   2 Flt/sec      820325.95 (  0.00%) 850289.77 (  3.65%)
    Clients   4 Flt/sec      1020068.93 (  0.00%) 1022674.06 (  0.26%)
    MMTests Statistics: duration
    Sys Time Running Test (seconds)             135.68    132.17
    User+Sys Time Running Test (seconds)         164.2    160.13
    Total Elapsed Time (seconds)                123.46    120.87

The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected).  The
actual number of page faults is noticeably improved.

For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.

To test the actual bug the commit fixed I opened two terminals.  The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data.  In a second window, the nodemask of the
cpuset was continually randomised in a loop.

Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.

Signed-off-by: Mel Gorman <[email protected]>
Cc: Miao Xie <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Christoph Lameter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

mm/hugetlb: fix warning in alloc_huge_page/dequeue_huge_page_vma

commit b1c12cbcd0a02527c180a862e8971e249d3b347d upstream.

Stable note: Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely
	expensive and severely impacted page allocator performance. This
	is part of a series of patches that reduce page allocator overhead.

Fix a gcc warning (and bug?) introduced in cc9a6c877 ("cpuset: mm: reduce
large amounts of memory barrier related damage v3")

Local variable "page" can be uninitialized if the nodemask from vma policy
does not intersects with nodemask from cpuset.  Even if it doesn't happens
it is better to initialize this variable explicitly than to introduce
a kernel oops in a weird corner case.

mm/hugetlb.c: In function `alloc_huge_page':
mm/hugetlb.c:1135:5: warning: `page' may be used uninitialized in this function

Signed-off-by: Konstantin Khlebnikov <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

vmscan: fix initial shrinker size handling

commit 635697c663f38106063d5659f0cf2e45afcd4bb5 upstream.

Stable note: The commit [acf92b48: vmscan: shrinker->nr updates race and
	go wrong] aimed to reduce excessive reclaim of slab objects but
	had bug in how it treated shrinker functions that returned -1.

A shrinker function can return -1, means that it cannot do anything
without a risk of deadlock.  For example prune_super() does this if it
cannot grab a superblock refrence, even if nr_to_scan=0.  Currently we
interpret this -1 as a ULONG_MAX size shrinker and evaluate `total_scan'
according to this.  So the next time around this shrinker can cause
really big pressure.  Let's skip such shrinkers instead.

Also make total_scan signed, otherwise the check (total_scan < 0) below
never works.

Signed-off-by: Konstantin Khlebnikov <[email protected]>
Cc: Dave Chinner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>

Linux 3.0.39

Loading branch information

jtlayton authored and synergydev committed Aug 3, 2012

1 parent 3b745d7 commit 323d145

Documentation/trace/postprocess/trace-vmscan-postprocess.pl

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -379,10 +379,10 @@ sub process_events {
  
    			# To closer match vmstat scanning statistics, only count isolate_both

    			# and isolate_inactive as scanning. isolate_active is rotation

    			# isolate_inactive == 0

    			# isolate_active   == 1

    			# isolate_both     == 2

    			if ($isolate_mode != 1) {

    			# isolate_inactive == 1

    			# isolate_active   == 2

    			# isolate_both     == 3

    			if ($isolate_mode != 2) {

    				$perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned;

    			}

    			$perprocesspid{$process_pid}->{HIGH_NR_CONTIG_DIRTY} += $nr_contig_dirty;

Makefile

-Original file line number
+Diff line change
@@ -1,6 +1,6 @@
     VERSION = 3
     PATCHLEVEL = 0
-    SUBLEVEL = 38
+    SUBLEVEL = 39
     EXTRAVERSION =
     NAME = Sneaky Weasel
@@ Expand Down @@

arch/mips/include/asm/thread_info.h

-Original file line number
+Diff line change
@@ Expand Up / @@ -60,6 +60,8 @@ struct thread_info { @@
     register struct thread_info *__current_thread_info __asm__("$28");
     #define current_thread_info()  __current_thread_info
+    #endif /* !__ASSEMBLY__ */
     /* thread information allocation */
     #if defined(CONFIG_PAGE_SIZE_4KB) && defined(CONFIG_32BIT)
     #define THREAD_SIZE_ORDER (1)
@@ Expand Down Expand Up @@
     #define free_thread_info(info) kfree(info)
-    #endif /* !__ASSEMBLY__ */
     #define PREEMPT_ACTIVE		0x10000000
     /*
@@ Expand Down @@

arch/mips/kernel/vmlinux.lds.S

-Original file line number
+Diff line change
@@ -1,5 +1,6 @@
     #include <asm/asm-offsets.h>
     #include <asm/page.h>
+    #include <asm/thread_info.h>
     #include <asm-generic/vmlinux.lds.h>
     #undef mips
@@ Expand Down Expand Up / @@ -73,7 +74,7 @@ SECTIONS @@
     	.data : {	/* Data */
     		. = . + DATAOFFSET;		/* for CONFIG_MAPPED_KERNEL */
-    		INIT_TASK_DATA(PAGE_SIZE)
+    		INIT_TASK_DATA(THREAD_SIZE)
     		NOSAVE_DATA
     		CACHELINE_ALIGNED_DATA(1 << CONFIG_MIPS_L1_CACHE_SHIFT)
     		READ_MOSTLY_DATA(1 << CONFIG_MIPS_L1_CACHE_SHIFT)
@@ Expand Down @@

drivers/base/memory.c

-Original file line number
+Diff line change
@@ Expand Up / @@ -223,41 +223,63 @@ int memory_isolate_notify(unsigned long val, void *v) @@
     	return atomic_notifier_call_chain(&memory_isolate_chain, val, v);
     }
+    /*
+     * The probe routines leave the pages reserved, just as the bootmem code does.
+     * Make sure they're still that way.
+     */
+    static bool pages_correctly_reserved(unsigned long start_pfn,
+    					unsigned long nr_pages)
+    {
+    	int i, j;
+    	struct page *page;
+    	unsigned long pfn = start_pfn;
+    	/*
+    	 * memmap between sections is not contiguous except with
+    	 * SPARSEMEM_VMEMMAP. We lookup the page once per section
+    	 * and assume memmap is contiguous within each section
+    	 */
+    	for (i = 0; i < sections_per_block; i++, pfn += PAGES_PER_SECTION) {
+    		if (WARN_ON_ONCE(!pfn_valid(pfn)))
+    			return false;
+    		page = pfn_to_page(pfn);
+    		for (j = 0; j < PAGES_PER_SECTION; j++) {
+    			if (PageReserved(page + j))
+    				continue;
+    			printk(KERN_WARNING "section number %ld page number %d "
+    				"not reserved, was it already online?\n",
+    				pfn_to_section_nr(pfn), j);
+    			return false;
+    		}
+    	}
+    	return true;
+    }
     /*
      * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is
      * OK to have direct references to sparsemem variables in here.
      */
     static int
     memory_block_action(unsigned long phys_index, unsigned long action)
     {
-    	int i;
     	unsigned long start_pfn, start_paddr;
     	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
     	struct page *first_page;
     	int ret;
     	first_page = pfn_to_page(phys_index << PFN_SECTION_SHIFT);
-    	/*
-    	 * The probe routines leave the pages reserved, just
-    	 * as the bootmem code does.  Make sure they're still
-    	 * that way.
-    	 */
-    	if (action == MEM_ONLINE) {
-    		for (i = 0; i < nr_pages; i++) {
-    			if (PageReserved(first_page+i))
-    				continue;
-    			printk(KERN_WARNING "section number %ld page number %d "
-    				"not reserved, was it already online?\n",
-    				phys_index, i);
-    			return -EBUSY;
-    		}
-    	}
     	switch (action) {
     		case MEM_ONLINE:
     			start_pfn = page_to_pfn(first_page);
+    			if (!pages_correctly_reserved(start_pfn, nr_pages))
+    				return -EBUSY;
     			ret = online_pages(start_pfn, nr_pages);
     			break;
     		case MEM_OFFLINE:
@@ Expand Down @@

drivers/md/dm-raid1.c

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1210,7 +1210,7 @@ static int mirror_end_io(struct dm_target *ti, struct bio *bio,
  
    	 * We need to dec pending if this was a write.

    	 */

    	if (rw == WRITE) {

    		if (!(bio->bi_rw & REQ_FLUSH))

    		if (!(bio->bi_rw & (REQ_FLUSH | REQ_DISCARD)))

    			dm_rh_dec(ms->rh, map_context->ll);

    		return error;

    	}

drivers/md/dm-region-hash.c

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -404,6 +404,9 @@ void dm_rh_mark_nosync(struct dm_region_hash *rh, struct bio *bio)
  
    		return;

    	}

    	if (bio->bi_rw & REQ_DISCARD)

    		return;

    	/* We must inform the log that the sync count has changed. */

    	log->type->set_region_sync(log, region, 0);

    @@ -524,7 +527,7 @@ void dm_rh_inc_pending(struct dm_region_hash *rh, struct bio_list *bios)
  
    	struct bio *bio;

    	for (bio = bios->head; bio; bio = bio->bi_next) {

    		if (bio->bi_rw & REQ_FLUSH)

    		if (bio->bi_rw & (REQ_FLUSH | REQ_DISCARD))

    			continue;

    		rh_inc(rh, dm_rh_bio_to_region(rh, bio));

    	}

fs/btrfs/disk-io.c

-Original file line number
+Diff line change
@@ Expand Up @@
     #ifdef CONFIG_MIGRATION
     static int btree_migratepage(struct address_space *mapping,
-    			struct page *newpage, struct page *page)
+    			struct page *newpage, struct page *page,
+    			enum migrate_mode mode)
     {
     	/*
     	 * we can't safely write a btree page from here,
@@ Expand All / @@ -816,7 +817,7 @@ static int btree_migratepage(struct address_space *mapping, @@
     	if (page_has_private(page) &&
     	    !try_to_release_page(page, GFP_KERNEL))
     		return -EAGAIN;
-    	return migrate_page(mapping, newpage, page);
+    	return migrate_page(mapping, newpage, page, mode);
     }
     #endif
@@ Expand Down @@

fs/cifs/readdir.c

-Original file line number
+Diff line change
@@ Expand Up @@
     	dentry = d_lookup(parent, name);
     	if (dentry) {
-    		/* FIXME: check for inode number changes? */
-    		if (dentry->d_inode != NULL)
+    		inode = dentry->d_inode;
+    		/* update inode in place if i_ino didn't change */
+    		if (inode && CIFS_I(inode)->uniqueid == fattr->cf_uniqueid) {
+    			cifs_fattr_to_inode(inode, fattr);
     			return dentry;
+    		}
     		d_drop(dentry);
     		dput(dentry);
     	}
@@ Expand Down @@

fs/hugetlbfs/inode.c

-Original file line number
+Diff line change
@@ Expand Up / @@ -568,7 +568,8 @@ static int hugetlbfs_set_page_dirty(struct page *page) @@
     }
     static int hugetlbfs_migrate_page(struct address_space *mapping,
-    				struct page *newpage, struct page *page)
+    				struct page *newpage, struct page *page,
+    				enum migrate_mode mode)
     {
     	int rc;
@@ Expand Down @@

fs/nfs/internal.h

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -315,7 +315,7 @@ void nfs_commit_release_pages(struct nfs_write_data *data);
  
    #ifdef CONFIG_MIGRATION

    extern int nfs_migrate_page(struct address_space *,

    		struct page *, struct page *);

    		struct page *, struct page *, enum migrate_mode);

    #else

    #define nfs_migrate_page NULL

    #endif

fs/nfs/write.c

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1662,7 +1662,7 @@ int nfs_wb_page(struct inode *inode, struct page *page)
  
    #ifdef CONFIG_MIGRATION

    int nfs_migrate_page(struct address_space *mapping, struct page *newpage,

    		struct page *page)

    		struct page *page, enum migrate_mode mode)

    {

    	/*

    	 * If PagePrivate is set, then the page is currently associated with

    @@ -1677,7 +1677,7 @@ int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
  
    	nfs_fscache_release_page(page, GFP_KERNEL);

    	return migrate_page(mapping, newpage, page);

    	return migrate_page(mapping, newpage, page, mode);

    }

    #endif

fs/ubifs/sb.c

-Original file line number
+Diff line change
@@ Expand Up / @@ -715,8 +715,12 @@ static int fixup_free_space(struct ubifs_info *c) @@
     		lnum = ubifs_next_log_lnum(c, lnum);
     	}
-    	/* Fixup the current log head */
-    	err = fixup_leb(c, c->lhead_lnum, c->lhead_offs);
+    	/*
+    	 * Fixup the log head which contains the only a CS node at the
+    	 * beginning.
+    	 */
+    	err = fixup_leb(c, c->lhead_lnum,
+    			ALIGN(UBIFS_CS_NODE_SZ, c->min_io_size));
     	if (err)
     		goto out;
@@ Expand Down @@

include/linux/cpuset.h

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -89,42 +89,33 @@ extern void rebuild_sched_domains(void);
  
    extern void cpuset_print_task_mems_allowed(struct task_struct *p);

    /*

     * reading current mems_allowed and mempolicy in the fastpath must protected

     * by get_mems_allowed()

     * get_mems_allowed is required when making decisions involving mems_allowed

     * such as during page allocation. mems_allowed can be updated in parallel

     * and depending on the new value an operation can fail potentially causing

     * process failure. A retry loop with get_mems_allowed and put_mems_allowed

     * prevents these artificial failures.

     */

    static inline void get_mems_allowed(void)

    static inline unsigned int get_mems_allowed(void)

    {

    	current->mems_allowed_change_disable++;

    	/*

    	 * ensure that reading mems_allowed and mempolicy happens after the

    	 * update of ->mems_allowed_change_disable.

    	 *

    	 * the write-side task finds ->mems_allowed_change_disable is not 0,

    	 * and knows the read-side task is reading mems_allowed or mempolicy,

    	 * so it will clear old bits lazily.

    	 */

    	smp_mb();

    	return read_seqcount_begin(&current->mems_allowed_seq);

    }

    static inline void put_mems_allowed(void)

    /*

     * If this returns false, the operation that took place after get_mems_allowed

     * may have failed. It is up to the caller to retry the operation if

     * appropriate.

     */

    static inline bool put_mems_allowed(unsigned int seq)

    {

    	/*

    	 * ensure that reading mems_allowed and mempolicy before reducing

    	 * mems_allowed_change_disable.

    	 *

    	 * the write-side task will know that the read-side task is still

    	 * reading mems_allowed or mempolicy, don't clears old bits in the

    	 * nodemask.

    	 */

    	smp_mb();

    	--ACCESS_ONCE(current->mems_allowed_change_disable);

    	return !read_seqcount_retry(&current->mems_allowed_seq, seq);

    }

    static inline void set_mems_allowed(nodemask_t nodemask)

    {

    	task_lock(current);

    	write_seqcount_begin(&current->mems_allowed_seq);

    	current->mems_allowed = nodemask;

    	write_seqcount_end(&current->mems_allowed_seq);

    	task_unlock(current);

    }

    @@ -234,12 +225,14 @@ static inline void set_mems_allowed(nodemask_t nodemask)
  
    {

    }

    static inline void get_mems_allowed(void)

    static inline unsigned int get_mems_allowed(void)

    {

    	return 0;

    }

    static inline void put_mems_allowed(void)

    static inline bool put_mems_allowed(unsigned int seq)

    {

    	return true;

    }

    #endif /* !CONFIG_CPUSETS */

include/linux/fs.h

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -523,6 +523,7 @@ enum positive_aop_returns {
  
    struct page;

    struct address_space;

    struct writeback_control;

    enum migrate_mode;

    struct iov_iter {

    	const struct iovec *iov;

    @@ -607,9 +608,12 @@ struct address_space_operations {
  
    			loff_t offset, unsigned long nr_segs);

    	int (*get_xip_mem)(struct address_space *, pgoff_t, int,

    						void **, unsigned long *);

    	/* migrate the contents of a page to the specified target */

    	/*

    	 * migrate the contents of a page to the specified target. If sync

    	 * is false, it must not block.

    	 */

    	int (*migratepage) (struct address_space *,

    			struct page *, struct page *);

    			struct page *, struct page *, enum migrate_mode);

    	int (*launder_page) (struct page *);

    	int (*is_partially_uptodate) (struct page *, read_descriptor_t *,

    					unsigned long);

    @@ -2478,7 +2482,8 @@ extern int generic_check_addressable(unsigned, u64);
  
    #ifdef CONFIG_MIGRATION

    extern int buffer_migrate_page(struct address_space *,

    				struct page *, struct page *);

    				struct page *, struct page *,

    				enum migrate_mode);

    #else

    #define buffer_migrate_page NULL

    #endif

include/linux/init_task.h

-Original file line number
+Diff line change
@@ Expand Up / @@ -30,6 +30,13 @@ extern struct fs_struct init_fs; @@
     #define INIT_THREADGROUP_FORK_LOCK(sig)
     #endif
+    #ifdef CONFIG_CPUSETS
+    #define INIT_CPUSET_SEQ							\
+    	.mems_allowed_seq = SEQCNT_ZERO,
+    #else
+    #define INIT_CPUSET_SEQ
+    #endif
     #define INIT_SIGNALS(sig) {						\
     	.nr_threads	= 1,						\
     	.wait_chldexit	= __WAIT_QUEUE_HEAD_INITIALIZER(sig.wait_chldexit),\
@@ Expand Down Expand Up / @@ -193,6 +200,7 @@ extern struct cred init_cred; @@
     	INIT_FTRACE_GRAPH						\
     	INIT_TRACE_RECURSION						\
     	INIT_TASK_RCU_PREEMPT(tsk)					\
+    	INIT_CPUSET_SEQ							\
     }
@@ Expand Down @@

0 comments on commit `323d145`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `323d145`

Commit

There are no files selected for viewing

0 comments on commit 323d145

0 comments on commit `323d145`