Sending First Patch To Upstream

Last week, I sent first patch to upstream and it was accepted in mm tree :). I am very happy about that. You can see it here.

Subsystem maintainers are very careful and a lot of people review the codes. Out side of staging directory, and before the internship, my first patch is about y2038 project. When I sent patch for y2038, a lot of developers reviewed and suggested something. I have recently seen the patch here:, it was my first experience :).

Todays, I work with linux kernel mm community, this makes me very excited and happy :).  I gained some experiences in this process and learnt how can I be sure with my changes. This is most important case for coding. When I talked with my mentor, every time I reported different thing :) and said "oh this prevents collapsing pages into a thp!". Because I was testing wrong. Finally I could find what was the problem.

For test results, I look /var/log/kern.log, it is very large file so I split it like that: "split -n 5" and look newly created small files and log time stamps is important. To be sure with my changes print out virtual memory address area for my test programs.

/proc/pid/smaps shows whole vmas for the process. pr_info("vm_start = %04lx\n", vma->vm_start); is enough to see begining address of the vma. Sometimes I need to see which process run this function, I print out current->pid. If I know what happened in every step, I find my faults very easy. I have to do something like that, because other processes will log about their huge pages in kern.log and I shouldn't confuse which process logged the results. To examining kern.log was big scale thing for me.

Before sending patch I need to be careful and check something for my patch. Also keeping focus on the issues is important. Working on kernel needs to pay attention more accorrding to other projects which I got experiences with them when I was student.

After this patch, I will work on zero pages and discover new things :).

29 Ocak 2015

Posted In: Gezegen, internship, kernel, linux, memory management, opw

Happy Coding & Testing Process

After third week, we started coding and made basic things. Our first aim to enable read-only ptes for collapsing. I still look for this issue. Something goes wrong and I can't see what is that.

To test my changes I've prepared test programs. They create pressure on memory and supply to swapped out system. Actually, they are very basic, just make malloc(), read/write operations on memory. memtest and stress are very strong workload programs, but to swapped out something they mix operations which are not correct for me. I should test specific conditions so use my test programs and they will be sophisticated later on.

To get informations about what happened with my changes, I use smem which shows swap usage percentage, pid, ppid with -t -p options and that's enough for me :). For specific process I look /proc/pid/smaps it gives anonhugepages/anonymouspages numbers and swap usage.

To look kernel messages we can use dmesg, but its size is not enough for me :) because I've been testing almost every line of the functions. To increase size of dmesg log, you should should set CONFIG_LOG_BUF_SHIFT in kernel config file. However When I setted it by 27 which means 2^27 bytes, make seems that doesn't accept this number! Probably, the number can be 16 or 17 as suggestion of config, but I'm not sure about that. Then I've looked /var/log/kern.log, and its size enough for me :) I'm sure about test results with it.

do_swap_page() makes swapped in operations afterward khugepaged scans the pages tables that come from do_swap_page(). There is no function call for khugepaged_scan_mm_slot(), it is called per 10000 miliseconds. Its call chain is like that:
khugepaged_scan_mm_slot() -> khugepaged_scan_pmd() -> collapse_huge_page() ->

Today I've realized in khugepaged_scan_pmd(), ptes seem unpresent! but they are swapped in. Then I need to look do_swap_page() again :).

16 Ocak 2015

Posted In: coding, Gezegen, internship, kernel, khugepaged, linux, smem, stress, testing

Begining to Read Memory Management Codes

My Linux Kernel internship started two weeks ago and will take 3 months. My mentor is Rik Van Riel. My project aim to fix transparent huge page swapping issues, if system needs to swap huge pages, it has to split the pages to small sized ones but then the system can not reconstitute the huge pages.

Rik asked me some questions about huge page and swapping and I've replied them. Before the reply the questions I've looked for following data structures and definitions.

Firstly I've started to examine struct page and mm_struct in mm_types.h. The kernel holds all information in mm_struct, it includes vm_area_struct *mmap which involves list of memory areas.  vm_areas_struct is an object to show memory areas. Also, the kernel threads don't use mm_struct so if you see if (!mm) {...} it means this is  kernel thread.

Likely & Unlikely Functions: Theese are for branch prediction so used for compiler optimization. They supply to guess which instruction will run and do read ahead.

Numa & Uma Systems: I've understood the two keywords looking the picture :).

Hot &  Cold Page: If a page in cpu cache, it is hot page however cold page is vice versa.

struct scan_control: It is used for page scaning, holds following variables:
unsigned long nr_scanned: How many inactive pages were scanned.
int may_writepage: It determines whether the kernel can write backing store.
swap_cluster_max: It is restriction for a lru list.

struct zone: The kernel divides memory to nodes, and the nodes include zone lists. Every zone area include pages. You can look for struct zone.

struct list_head: Doubly linked list, to build & manage lists.

Page Flags:

High Memory: Linux Kernel seperate high rate of memory for user space so high memory means user space.

Page Vector: Provides operations on page list instead of individual pages.

Slot: Swap area is divided to slots. Size of each slot equals to page frame.

up_read/write, down_read/write functions: They are for spinlock issues and includes assembly instructions.

BUG_ON* functions: Checks given condition and returns system halt or nothing.

Swap Cache: Some pages after swapped out, if the page is not changed, it has an entry on swap cache and system can read data on memory withouth get the page to back memory.

Transparent Huge Page vs. Huge Page: Transparent huge page supplies a layer for huge page.

Note-1: Swap space used by user space tools (mkswap)

Note-2: x86 systems don't use pte level for THP (transparent huge page), it can direct access data on pmd.

Following questions which are asked to me by my mentor. I've explained just important points for my project and their function traces because there are a lot of functions, sometimes they can be very complex :).

Below call chains for Linux Kernel - 3.18

1) from do_page_fault(), sometimes the VM uses transparent huge pages
   (2MB size on x86) for anonymous memory. What functions does the
   code go through between do_page_fault() and the function that
   installs 2MB pages in the process page tables?
When I examined functions, I saw a lot of spinlock functions and Rik said, they for ensure that multiple concurrent instances of the page fault code do not manipulate the page table simultaneously.

  __do_page_fault() /* checks the fault is belong to bad area or good area */
pgtable_trans_huge_withdraw takes a page table page from the process's
reserve of page table pages, so the 2MB page (mapped at pmd level) can
be mapped as 4kB page entries (at the pte level).

2) When are 2MB pages used?

If PAE is enabled, then use 2mb pages. I've looked for it following links:

3) What does the VM do when a 2MB page cannot be allocated?
   (still in memory.c and huge_memory.c)
In  do_huge_pmd_anonymous_page(), if it can not allocate 2MB page;
it returns, out of memory or fall back. It also calls count_vm_event()
with THP_FAULT_FALLBACK argument. At line: 824, it tries to set
huge zero page, if it can't do that, calls put_huge_zero_page(),
which calls atomic_dec_and_test(). 

At line: 839: If it couldn't install huge page, it calls
put_page(). I've thought;in put_page, it checks whether
the page compound or not, but the page will be compound
always, because the page comes from alloc_hugepage_vma().

4) When the system runs low on memory and wants to swap something
   out, it will split up a huge page before assigning it space in
   a swap area. Find the code in vmscan.c, swapfile.c and huge_memory.c
   that does that. What does the function trace look like from
   try_to_free_pages to the function that splits the huge pages?
  throttle_direct_reclaim(gfp_mask, zonelist, nodemask)
    do_try_to_free_pages(zonelist, &sc)
      do while { vmpressure_prio()
      shrink_zones() /* if a zone reclaimable it returns true */}

I've seperated shrink_zones() to below:

  populated_zone() {return (!!zone->present_pages);}
  zone_reclaimable_pages(zone) -> get_nr_swap_pages()

try_to_free_pages(): If memory is not sufficent, it checks pages and removes least used one.
shrink_zones(): It is runned by kswapd with specified time interval and used for remove rarely used
pages. It also balances inactive and active lists using shrink_active_list().
shrink_active_list(): Provides to transfer pages between active_list and inactive_list and detect least used active lists and also implements page selection.
shrink_inactive_list(): Removes lists from inactive_list and send the lists to shrink_page_list().

In general, shrink_* functions run per zone.

5) in huge_memory.c look at collapse_huge_page and the functions
   that call it - under what conditions does the kernel gather up
   512 4kB pages and collapse them into one 2MB page?
                khugepaged_alloc_page() /* allocate new page */
                __collapse_huge_page_isolate(vma, address, pte); /* this one is new function for me */
                if (isolate_lru_page(page)) { ... }
                if (pte_young(pteval) || PageReferenced(page) ||
                        mmu_notifier_test_young(vma->vm_mm, address)) { ... }

collapse_huge_page_isolate() removes pages from lru with isolate_lru_page().
I've thought: when collapsing pages, their lru's will change. So it isolates

Note-1: __collapse_huge_page_copy(): 
The 4kB pages could be anywhere in memory.
The 2MB page needs to be one contiguous page.
That means the contents of the 4kB pages need
to be copied over into the one 2MB page.
khugepaged_scan_pmd(), if page is young, it will call collapse_huge_page().
If the collapse function can correct vma, pmd and isolate pages, it collapses

6) under what conditions does the kernel decide not to collapse
   the 4kB pages in a 2MB area into a 2MB page?
There some conditions for it:
1) If can't alloc khuge page, it won't collapse.
2) I've looked to this condition in collapse_huge_page():
        if (unlikely(khugepaged_test_exit(mm))) {goto out;}
   if mm has no user, it goes to label out and doesn't collapse pages.
3) If it can't find vma and pmd
4) If it can't isolate pages

7)  look at what happens when shrink_page_list()
passes a 2MB transparent huge page to add_to_swap()
When it sent 2 MB page to add_to_swap function, it firstly checks whether page locked and up to date then calls get_swap_page(). If there is no swap page returns 0, If not it checks transHugePAge() then implements split_huge_page_to_list(). In split_huge_page_to_list it gets anonymous vma and does write-lock for it and checks PageCompound. With PageCompound it controls the is huge or not.  Then it checks PageSwapBacked. Then calls __split_huge_page() and the function wants the page shouldn't be tail and splits the page in __split_huge_page_splitting(). The function backs to add_to_swap and does swapcache_free() issues.

8) Can you explains what the page looks like after it has been split?
What happened to the 2MB page?  What do we have instead?
What happened with the PageCompound flag?
__split_huge_page(), it calls __split_huge_page_splitting() in the iteration. It counts number of mapped pmds before splitted it and increase mapcount.

In split_huge_page_map(), it takes page offset with address argument. Firstly, it checks pmd address validity. It
creates small entries in for loop with mk_pte(), set_pte_at(), pte_unmap (this one is just nop instruction for x86 systems). The for loop does one entry for page one, then page two, then page three etc. It changes address of entry adding pagesize (haddr += PAGE_SIZE) up to number of pmd.

I've asked, why pmd_populate()  is performed two times at lines: 1790, 1843?
Rik's answer: The first pmd_populate puts in place a special transparent huge page
PMD that says "this transparent hugepage is being split, do not mess
with it".

The second pmd_populate puts in place the page table page containing
the 4kB pages that map the 2MB area.

Note-1: In __split_huge_page() iterates vma from root of red black tree at line: 1864 but the function gets only one page and a page can match just one vma. So why it needs to iterate vma?

Rik replied my question: "The same page may be shared by multiple processes, if the
process that created the 2MB page originally called fork() afterwards."

Note-2: In  __split_huge_page_splitting(), it calls  pmdp_splitting_flush() what does it do also pmd_update and flush_tlb_range function? I think it should save pmd's content before splitting, it shouldn't lose it. Why it flushes pmd?
Rik's answer: if a small page is touched or dirtied afterwards, we want the MMU to set the accessed and/or dirty bit on the 4kB page entry.

Note-3: We can ignore PVOP_VCALL stuff - that is for Xen, which uses an alternate function for filling in page table info. 

9) Under what conditions can a region with 4kB pages be
turned into a 2MB transparent huge page?
I've traced following call chain:
                        __handle_mm_fault() /* check conditions */
                                do_huge_pmd_anonymous_page() /* check conditions */
                                __do_huge_pmd_anonymous_page() /* check conditions */

In __handle_mm_fault(), "if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) { ... }" if the expression is correct, it can realize do_huge_pmd_anonymous_page(). I've seen this quote for pmd_none() "if page is not in RAM, returns true." But I think, if page is not used for any process, it includes zeros and should be in RAM.

In do_huge_pmd_anonymous_page(), "if (!(flags & FAULT_FLAG_WRITE) && transparent_hugepage_use_zero_page()) { ... }"
if it can correct the condition, it can start to create transparet huge page. I've looked for condition values.
flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; allow retry and killable flag values are defined with special values (0x08, 0x02)?
I think their values only for to check something. And transparent_hugepage_flags is 0UL, is it always have this value? I've looked for its value,
probably always have same value. The last condition creates huge zero page using  set_huge_zero_page() which calls pmd_mkhuge().

One more condition: __do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page), if it returns false, that means created transparent huge page at line: huge_memory.c#L808
If pmd_none() returns true, creates thp.

10) What code turns 4kB pages into a 2MB page?
pmd_mkhuge() installs 2 MB page. Actually, it does pmd_set_flags(pmd, _PAGE_PSE).
_PAGE_PSE uses _PAGE_BIT_PSE which means huge page.

11) Under what conditions can a region with 4kB pages not
be turned into a 2MB transparent huge page?
There are a lot conditions for this.
1) If the entire 2MB page inside vma, return fall back.
2) If it can't create anonymous vma, return out of memory.
3) If it can't create huge page vma, return fall back.
4) If it get true from __do_huge_pmd_anonymous_page(), return fall back.
5) in __do_huge_pmd_anonymous_page(), if page is not support huge page, the code create kernel panic, and halt.
   VM_BUG_ON_PAGE(!PageCompound(page), page);

6)  If it cannot allocate a huge page

24 Aralık 2014

Posted In: Gezegen, huge page, internship, kernel, linux, memory management, opw, swap, thp

Linux Kernel Internship

Last month I've applied Gnome Outreach Program for Women and sent patches for Linux Kernel. I've applied it because wanted to learn low level things. I also really like computer design and architecture topics. Actually, I have not enough knowledge about them but like to learn them.

Linux Kernel Community accepted to me as an intern. I'll study on Khugepaged swap readahead project. Working with Linux Kernel team will be great experience for me. They really want to help kernel newbies :).

Actually, studying on Linux Kernel needs a lot reading. Just for writing a few code lines needs to read one chapter from one book, a few blog posts about topic and ask something to developers :).

Nowadays, I've started to read about memory management issues like TLB, Huge Pages, Page Fault from one operating system book and also examine do_page_fault() function.

26 Kasım 2014

Posted In: Gezegen, Gnome, internship, kernel, linux, opw

Twitter Auto Publish Powered By :