Reusable iterators in JavaScript and C#

In this article we will explore the differences in behavior between the JavaScript and .NET iterators, hack the behavior of C# iterators and look at our options to make JavaScript iterators reusable.

A few days ago a user reported an issue for linq, a library that I'm maintaining. The library, as the name suggests, is a JavaScript implementation of the .NET LINQ library, and the issue was particularly interesting because at first sight it looks like a clear bug, but once you take a closer look things are not that obvious.

So the reported bug is basically this:

const set = new Set([ 1, 2, 3 ])
const a = Enumerable.from(set.entries());
console.log(a.toArray());
console.log(a.toArray())

Output:

[[1, 1], [2, 2], [3, 3]]
[]

After the first call of a.toArray() it appers that the set entries are consumed, so the second call returns nothing. Let's see why.

Iterators

The code makes use of an iterator, which is the return value of set.entries(). But what are iterators anyway?

The short version is they are objects that can be used to step though collections. The long version is of course the documentation: Iterators (C#), Iteration Protocols for JS.

They usually come hand in hand with generator functions. These are functions that preserve state between calls, and use the yield keyword to switch context back to the caller.

function* words() {
    yield "moon";
    yield "sun";
}

for (let word of words())
    console.log(word)

The return value of a generator function is actually an iterator, which allows it to be used in a for..of loop in JS, or foreach...in in C#.

C# and JavaScript differences

Despite the similarities, it quickly becomes apparent that there is a significant difference in the way the two languages implement iterators. Take these two pieces of code:

JS:

function* words() {
    yield "moon";
    yield "sun";
}
var iter = words()

for (let word of iter)
    console.log(word)

for (let word of iter)
    console.log(word)

Output:

moon
sun

C# :

static IEnumerable<string> words()
{
    yield return "moon";
    yield return "sun";
}

var iter = words();

foreach (var item in iter)
    Console.WriteLine(item);

foreach (var item in iter)
    Console.WriteLine(item);

Output:

moon
sun
moon
sun

It seems like the JavaScript iterator is consumed after the first loop, so the seconds loop prints nothing.

C# on the other hand does some wizardry behind the scenes and allows the iterator to be reset each time it is used in a new context. This is a bit puzzling, so let's try to find out what's going on behind the scenes.

This older article by Raymond Chen has a good description of the underlying mechanics. Also this article has a good intro on what the compiler does and how the logic evolved over the years, leading to async iterators.

But essentially, the C# compiler generates a new class for the iterator. This class acts like a state machine and has an internal state member and a current member that holds the most recent object being enumerated.

Changing the C# iterator behavior

Let's check out this compiler generated class by using ILSpy to decompile the dll of our simple console application.

Output from ILSpy:

[CompilerGenerated]
private sealed class <words>d__0 : IEnumerable<string>, IEnumerable, IEnumerator<string>, IEnumerator, IDisposable
{
    private int <>1__state;
    private string <>2__current;
    private int <>l__initialThreadId;

    ...

    private bool MoveNext()
    {
        switch (<>1__state)
        {
        default:
            return false;
        case 0:
            <>1__state = -1;
            <>2__current = "moon";
            <>1__state = 1;
            return true;
        case 1:
            <>1__state = -1;
            <>2__current = "sun";
            <>1__state = 2;
            return true;
        case 2:
            <>1__state = -1;
            return false;
        }
    }

    ...

    [DebuggerHidden]
    void IEnumerator.Reset()
    {
        throw new NotSupportedException();
    }

    [DebuggerHidden]
    IEnumerator<string> IEnumerable<string>.GetEnumerator()
    {
        if (<>1__state == -2 && <>l__initialThreadId == Environment.CurrentManagedThreadId)
        {
            <>1__state = 0;
            return this;
        }
        return new <words>d__0(0);
    }

    ...
}

Two methods stand out here: MoveNext, which is the state dispatcher and GetEnumerator, the method being called to initialize the object in a foreach context. It's pretty clear why the iterator resets with every foreach loop: GetEnumerator returns a new object every time:

return new <words>d__0(0);

I wonder if we can change this behavior and make it more JavaScript-like, where the iterator gets consumed after running through all the values.

Let's take a look at the IL code generated for our example. We can do this by disassembling the program with ildasm.exe. From a Visual Studio Developer Command Prompt:

ildasm /out:ConsoleTest.il /source /nobar ConsoleTest.dll

Let's find the place where GetEnumerator returns a new object. It appears to be here:

  IL_0022:  ldc.i4.0
  IL_0023:  newobj     instance void Program/'<words>d__0'::.ctor(int32)
  IL_0028:  stloc.0
  IL_0029:  ldloc.0
  IL_002a:  ret
} // end of method '<words>d__0'::'System.Collections.Generic.IEnumerable<System.String>.GetEnumerator'

Let's change it so that it returns this instead of the new object:

  //IL_0022:  ldc.i4.0
  //IL_0023:  newobj     instance void Program/'<words>d__0'::.ctor(int32)
  IL_0022:  ldarg.0
  IL_0028:  stloc.0
  IL_0029:  ldloc.0
  IL_002a:  ret
} // end of method '<words>d__0'::'System.Collections.Generic.IEnumerable<System.String>.GetEnumerator'

Here we've commented out the lines responsible for creating the new object. Then we're loading the this pointer onto the stack with ldarg.0 (inside the method arg.0 contains the this pointer). We're then popping the stack and storing the previous value to local variable 0 (stloc.0). The last two lines of the method remain unchanged, they are loading our local variable onto the stack and returning it.

Let's see if this works. We need to reassemble our program. From the same Visual Studio Developer Command Prompt:

ilasm /dll /out=ConsoleTest.dll ConsoleTest.il

Running ConsoleTest.exe now will output:

moon
sun

Voila, the words are now displayed only once. Just like in JavaScript, the second time we run through our iterator it is already "consumed".

Now we can probably take this a step further and have the whole process of dissasembly, code change and reassembly done automatically by some PowerShell script, but is this something you would do in production? Probably not. Still, it was a nice hack.

Reusable iterators in JavaScript

What about JavaScript, is there a way to have reusable iterators like in C# ?

People have written various functions to clone the original iterator, so when that runs out of values you could use the clone. But this is problematic because it relies on buffering, when the values come in from the iterator they are stored in a buffer, and then they are fed from that buffer. Not ideal if you have to work with 1M records.

So you can't really clone or reset an iterator in JavaScript. To reiterate the values, you can simply call the generator function again. But in the case of our library linq, this was not an option since the object already comes in as a parameter.

Our solution was to allow a different way to initialize the Enumerable, using a lambda instead of passing in the direct object:

const set = new Set([ 1, 2, 3 ])
const a = Enumerable.from(() => set.entries());
console.log(a.toArray());
console.log(a.toArray())

WIth this solution, we can store the anonymous function and whenever a new iteration is required, we can call the function to get a new iterator.

This was a fun topic to explore and there's certainly more to it, especially when talking about async iterators, but maybe that's something to discuss in a future article.

2022-11-20

Comments