Rebased the bugfix from the original Google Code issue #292 to work with Arduino 1.6.x
Description of original fix provided by Pete62:
The later 8 bit AVR's use two registers (TCCRxA, TCCRxB) whereas the ATmega8 only uses a single register (TCCR2) to house the control bits for Timer 2. Bits were inadvertently being cleared.
To avoid having a .cpp just for an extern variable definition, `static`
has been chosen over `extern`.
As the `EEPROMClass` class simply wraps functionality located elsewhere,
it is completely compiled away. Even though each translation unit which
includes the header will get a copy with internal linkage, there is no
associated overhead.
More info
[here](http://stackoverflow.com/questions/29098518/extern-variable-only-in-header-unexpectedly-working-why)
Previously, the TX pin would be set to output first and then written
high (assuming non-inverted logic). When the pin was previously
configured for input without pullup (which is normal reset state), this
results in driving the pin low for a short when initializing. This could
accidenttally be seen as a stop bit by the receiving side.
By first writing HIGH and then setting the mode to OUTPUT, the pin will
have its pullup enabled for a short while, which is harmless.
Instead of using a lookup table with (wrong) timings, this calculates
the timings in SoftwareSerial::begin. This is probably a bit slower, but
since it typically happens once, this shouldn't be a problem.
Additionally, since the lookup tables can be removed, this is also a lot
smaller, as well as supporting arbitrary CPU speeds and baudrates,
instead of the limited set that was defined before.
Furthermore, this switches to use the _delay_loop_2 function from
avr-libc instead of a handcoded delay function. The avr-libc function
only takes two instructions, as opposed to four instructions for the old
one. The compiler also inlines the avr-libc function, which makes the
timings more reliable.
The calculated timings directly rely on the instructions generated by
the compiler, since a significant amount of time is spent processing
(compared to the delays, especially at higher speeds). This means that
if the code is changed, or a different compiler is used, the
calculations might need changing (though a few cycles more or less
shouldn't cause immediate breakage).
The timings in the code have been calculated from the assembly generated
by gcc 4.8.2 and gcc 4.3.2.
The RX baudrates supported by SoftwareSerial are still not unlimited. At
16Mhz, using gcc 4.8.2, everything up to 115200 works. At 8Mhz, it works
up to 57600. Using gcc 4.3.2, it also works up to 57600 at 16Mhz and up
to 38400 at 8Mhz. Note that at these highest speeds, communication
works, but is still quite sensitive to other interrupts (like the
millis() interrupts) when bytes are sent back-to-back, so there still
are corrupted bytes in RX.
TX works up to 115200 for all combinations of compiler and clock rates.
This fixes#2019
Before, the interrupt would remain enabled during reception, which would
re-set the PCINT flag because of the level changes inside the received
byte. Because interrupts are globally disabled, this would not
immediately trigger an interrupt, but the flag would be remembered to
trigger another PCINT interrupt immediately after the first one is
processed.
Typically this was not a problem, because the second interrupt would see
the stop bit, or an idle line, and decide that the interrupt triggered
for someone else. However, at high baud rates, this could cause the
next interrupt for the real start bit to be delayed so much that the
byte got corrupted.
By clearing the interrupt mask bit for just the RX pin (as opposed to
the PCINT mask bit for the entire port), any PCINT events on other bits
can still set the PCINT flag and be processed as normal. In this case,
it's likely that there will be corruption, but that's inevitable when
(other) interrupts happen during SoftwareSerial reception.
This precalculates the mask register and value, making setRxIntMask
considerably less complicated. Right now, this is not a big deal, but
simplifying it allows using it inside the ISR next.
Since those functions are only called once now, it makes sense to inline
them. This saves a few bytes of program space, but also saves a few
cycles in the critical RX path.
Previously, up to four separate but identical ISR routines were defined,
for PCINT0, PCINT1, PCINT2 and PCINT3. Each of these would generate
their own function, with a lot of push-popping because another function
was called.
Now, the ISR_ALIASOF macro from avr-libc is used to declare just the
PCINT0 version and make all other ISRs point to that one, saving a lot
of program space, as well as some speed because of improved inlining.
On an Arduino Uno with gcc 4.3, this saves 168 bytes. With gcc 4.8, this
saves 150 bytes.
Similar to SoftwareSerial::write, this rewrites the loop to only touch
the MSB and then shift those bits up, allowing the compiler to generate
more efficient code. Unlike the write function however, it is not needed
to put all instance variables used into local variables, for some reason
the compiler already does this (and doing it manually even makes the
code bigger).
On the Arduino Uno using gcc 4.3 this saves 26 bytes. Using gcc 4.8 this
saves 30 bytes.
Note that this removes the else clause in the code, making the C code
unbalanced, which looks like it breaks timing balance. However, looking
at the code generated by the compiler, it turns out that the old code
was actually unbalanced, while the new code is properly balanced.
This change restructures the loop, to help the compiler generate shorter
code (because now only the LSB of the data byte is checked and
subsequent bytes are shifted down one by one, it can use th "skip if bit
set" instruction).
Furthermore, it puts most attributes in local variables, which causes
the compiler to put them into registers. This makes the timing-critical
part of the code smaller, making it easier to provide accurate timings.
On an Arduino uno using gcc 4.3, this saves 58 bytes. On gcc 4.8, this
saves 14 bytes.
Somehow gcc 4.8 doesn't inline this function, even though it is always
called with constant arguments and can be reduced to just a few
instructions when inlined. Adding the always_inline attribute makes gcc
inline it, saving 46 bytes on the Arduino uno.
gcc 4.3 already inlined this function, so there are no space
savings there.