Half SQL: semi-opaque tables reduce developer effort

Your relational data design could steal a trick from NoSQL: database representation of your application objects doesn’t need to be broken down to primitive values. Some opacity of relational data is beneficial.

Relational databases impose rigid structure on the data: data is organized in tables, which are made up of columns. Traditionally the columns could only be of primitive types (numeric or alpha), although opaque types (BLOB, CLOB) and XML are now universally available. The rigid structure makes it easy to reason about data, to guarantee data quality in the database layer and to build tooling on top of the database. Querying based on values of individual columns is well defined and highly efficient. Downstream usage (reporting, analysis) becomes easier. Relational structure works well when you need referential integrity across entities that are not updated together.

But relational databases have downsides. One of the downsides is the notorious difficulty of changing the database structure. Martin Fowler wrote: “A lot of effort in application development is tied up in working with relational databases. Although Object/ Relational Mapping frameworks have eased the load, the database is still a significant source of developer hours”. Guy Harrison blogging for Tech Republic: “Change management is a big headache for large production RDBMS. Even minor changes to the data model of an RDBMS have to be carefully managed and may necessitate downtime or reduced service levels“. There is “impedance mismatch”  between RDBMS and application layer: “Application developers have been frustrated with the impedance mismatch between the relational data structures and the in-memory data structures of the application”. Even a change confined to a single table (e.g. adding a column) requires significant effort and synchronizing rollout of database and application layers.

The frustration with the amount of developers’ effort that the relational databases required was one of the drivers behind the rise of NoSQL starting about a decade ago. NoSQL databases differ from the relational in many other ways and switching to NoSQL is not an easy step. Fortunately, you can solve some of your problems by using just one item from the NoSQL bag of tricks. You can greatly reduce the impact of single-table changes (such as adding a new column, the most frequent type of change) by making your table definition semi-opaque. Don’t throw away your old and trusty RDBMS. Turn it into a Half SQL data store: in each table, select a small number of key columns that may be used in indexes and keep them. Hide all other fields from RDBMS by placing them into an opaque container field of a BLOB type. As a simplified example, Orders table may look like this:

order_table

Your application will be responsible for serializing and deserializing those blobs. Adding a new object field will be invisible to the RDBMS. When you need to add a new field, you will only need to change the code in the serializer/deserializer. And if you use a good serialization library (if your application is written in Java, please, don’t use built in serialization; there are many libraries that are faster and more flexible), even those changes in most cases will be NOOP because your library will take care of those automatically. No data migration will be needed. You will be able to write test to verify that your logic works before and after the change. And you retain the RDBMS goodness of referential integrity and super-fast queries over the indexed columns.

Stashing all “other” object fields into a BLOB column could save you quite a bit of effort.

Machine Learning and Artificial Intelligence

There is a popular line of thinking about Artificial Intelligence (AI) that says that Machine Learning is the pathway to AI. This idea gained momentum partially because Machine Learning is the hot topic (and buzzword of the decade), but also because we understand ML fairly well. With ML in hand, it is easier to make machines that can learn and become smart, while making machine smart out of the box is a much more difficult task.

This approach makes sense. After all, the only intelligence that we know – human – is learned, not built. In other words, even if we were to invent Asimov-style positronic brains, robots will still need a period of learning after being fully assembled. Of course robots will need to learn a whole lot faster than humans – nobody wants a robot that takes 18 years to become functional. Fortunately, robots can learn much faster than humans. But to learn fast, robots need a lot of input to process. It means that they will have to learn from shared information (as opposed to the personal experiences alone). To give a hypothetical example, a robot would become a good driver much faster if it can learn not just from its own driving history, like all humans do, but from the experience of all the other robot drivers. Robots will need to remain in constant communication with each other (whether centralized or decentralized) in order to keep up with all the changes in the world around them and get better at whatever they’re doing.

Now, how will ML help us achieve AI? By itself, ML is just machine code capable of syntactic manipulation based on certain statistical rules, which is not intelligence of any kind, not even a weak AI. Will it lead a computer system to developing intelligence? The Chinese Room argument says it’s impossible to jump from syntactic rules to semantic understanding, so there is still a big phase transition to be made before AI becomes a reality. Patience, please.

Wer wartet mit Besonnenheit

Der wird belohnt zur rechten Zeit

(Roughly: Whoever waits patiently, will be rewarded at the right time. © Rammstein).

Disclaimer: opinions expressed in this post are strictly my own – not of my employer.

Please NO: Apple patents camera-blocking technology.

Apple patented technology that would allow blocking the use of cameras in iPhones wherever it is deployed. Who is it made for? Certainly not consumers! How many iPhone buyers ever said: “I wish they disabled my camera here”?

In a disingenuous propaganda move, Apple described  this technology as being able to “…stop smartphone cameras being used at concerts”. Right. It’s about concerts. It can’t possibly be meant to be used by corporate or government security. And I’m sure no police department would think of wiring this device to a squad car flashers, to prevent bystanders from filming police work. It’s strictly about concerts. Nothing to see and record here, move along.

United’s poor “multi-factor authentication”

United Airlines (united.com) recently “upgraded” their Web site security. They sensibly discontinued 4-digit PIN logins and require a password of at least 8 characters – standard practice these days. It would’ve been a reasonable change, if they didn’t leave a loophole one can fly an airliner through.

As a compliment to stronger passwords, united.com also required account holders to set up “secret questions”. Leaving aside the question whether this is a good security measure in general, United’s implementation is recklessly poor. A user can’t enter their own answer – one must select from a small list of curated items. For the question “What color was the home you grew up in?”, there are 12 choices available. “What is your favorite cold weather activity?” gives you 23 options. Those are low numbers – but it gets worse! When trying to reset a password, a user will be presented with 2 questions – and only 10 choices to select from for each question!

United reset password fruit 1-10

So you only need to guess 1 out of 10 twice – and you are in.

This is not extra security, this is security theater. But of course in air travel, security theater is the norm (great job, TSA!)

Are True Names in Earthsea digital encryption keys?

I’ve recently reread Earthsea, a classic fantasy novel by Ursula LeGuin. The novel is set in a magic world and tracks the progress of Ged, a mage with unprecedented powers, from his childhood to adulthood. Keeping in mind an often cited quote by another classic writer, Arthur C Clarke: “Any sufficiently advanced technology is indistinguishable from magic”, I decided to peel back the curtain of magic and imagine what practical inventions may stand behind the magical concepts of the Earthsea world. I’m most interested in the concept of true names. I will show how true names could relate to modern day digital encryption – with a mage’s true name being his personal digital certificate.

Some of the technology behind the magic in the book is trivially easy to see. For example, when Ged applied a spell to a brackish spring on a tiny sea island, it means he’s installed a filtration system. Water desalination was a plausible alternative, but, since it is more complex and bulky (using modern technology), it would be less likely.

There are many cases when figuring out the exact technical solution is impractical because it is described in magical terms. Now, any technology would look like magic to people who don’t understand it. They would simply have no concepts to explain it and wouldn’t know where to look for clues. So naturally people in medievalist societies where magic fantasies are usually set can’t provide adequate descriptions of advanced technology they mistake for magic. Hence the readers lack cues to accurately translate the magic to the modern day technical concepts. This is certainly by design.

With this, let’s get to true names, the most creative and crucially important device in this book. Many entities in the Earthsea universe have secret true names, which have special magic meaning. Here’s what we know about true names:

  • Humans, other live beings and geographical objects (seas and islands) have true names.
  • Each human has a unique true name.
  • True name is permanent and can’t be changed.
  • Animals have one true name for the whole specie.
  • Ged’s true name was assigned to him by a mage when he was near puberty. He didn’t have a true name before that. Many other people aged from 14 and up also have true names. No human person other than child Ged is specifically described as having no true name.
  • True names can be used by beings with magical powers (human mages and dragons) to control behavior of the name bearer

What we don’t know:

  • Do human-made objects (such as houses and boats) have true names? We know that spells are routinely applied to them, but it’s not clear if such spells require knowing the true name of the object.
  • Do all humans get their true names assigned by mages in their teen years? There is not enough information is Earthsea to draw conclusions (and I didn’t read sequels to Earthsea).

Based on what we know, true names must mean different things for humans, animals and inanimate objects. This would explain the fact that human true names are individual, but animals of the same specie share the same true name. We also know that there are several kinds of magic, taught by different Master Mages. It stands to reason that if true names unlock different kinds of magic, then the true names themselves may be of different types. Let’s start with animals, which seem to be the easiest case. For animals, the logical true name would be a DNA signature. The magic it enables is, then, biotech-based.

For humans, we could also go with personally targeted bio. One example of this is found in Vernor Vinge’s book Rainbows End. The book is mostly about near term advanced digital technology and augmented reality. At one point Alice Gu is incapacitated by a virus targeted to her individually. The creation of the virus required collection of her DNA sample.

There are other ways to individually target humans. The simplest one is an identifier issued by a competent authority, something like a Social Security Number in US. And it doesn’t have to be a number – it could be just a name. After all, only a few years after Earthsea, Ursula LeGuin uses this idea in her book Dispossessed – individual names are unique and assigned on planet-wide basis.

In this case, determining someone’s true name would be roughly equivalent to modern day identity theft. We know that identity theft enables the thief to take over the victim’s accounts, disrupting his ability to communicate and access to finances. It can be really demoralizing, which matches the effect of an adversary knowing your true name (rendering you unable to defend yourself). For this to work, the target has to be quite advanced and sophisticated: medieval subsistence farmers won’t suffer much from identity theft, because they don’t bank or use Skype. From the book, it’s not clear whether revealing of the true name is equally damaging to different people. We know that it has debilitating effect on mages, but they are a unique cast living by special rules. We have to assume that they are the super-sophisticated hi-tech elite of the Earthsea, so they could indeed suffer from identity theft.

Another possibility is a breach of anonymity. In one of the first Sci-Fi books of the digital era, True Names, Vernor Vinge shows how cyberspace character is in grave danger if his real world identity is exposed. Thus, true name is simply the real name in the physical world. In the True Names, the danger comes from the government, which is able to apply its heavy hand to coerce desired behavior in cyberspace. But “we know where you live” has also been an effective threat from non-government actors, such as mafia, KKK and jihadists. In the Internet era this threat has increased in poignancy and acquired a new name: doxing.

Finally, here’s the tantalizing possibility: true name could be a person’s private encryption key. Such a key is necessary to create digital signatures proving authenticity of communications. And it must be kept secret, otherwise an attacker could impersonate the victim. Again, this will only work against hi-tech targets dependent on electronic gadgets and communication services. If you are a mage deploying an array of electronic devices, sensors, weapons and so on, a lost private key means that your adversary can now impersonate you in all digital communications. He can issue commands that will appear to be coming from you. He can control all of your devices and turn your weapons against you, which must spell quick and immediate defeat. The threat is made more severe by the fact that in the Earthsea, the key (true name) can’t be changed. We can see the impact of this oversight in the book, where once someone’s true name is known, that person (or dragon) remains in danger forever. An ability to change keys is a fundamental tenet of modern crypto systems: if keys are stolen, changing the keys (think passwords) allows a quick closing of the security breach. Changing keys is so important that some security experts advise against biometric security (e.g. fingerprints or retina scans) because the secrets are unchangeable. Once your fingerprints make it into the wrong hands, you are as helpless as a wizard of Earthsea whose true name has been revealed.

If true names can be understood in terms of  digital encryption, then a book on the current FBI vs Apple encryption fight was written half a century ago.

Locating JIT-generated machine code using gdb

This article describes how I was able to locate machine instructions generated by OpenJDK JIT compiler on Linux using gdb debugger. It may be useful to those who want to understand what JIT does with your code.I used this procedure to find that JIT emits efficient branchless code, as described here. I decided to share this information because I spent considerable amount of time figuring it out. It’s possible that easier approaches exist – please feel free to suggest.

First, I used these 2 options with java command: -XX:+AggressiveOpts -XX:+PrintCompilation

AggressiveOpts forces JVM to compile most everything it sees into native instructions. PrintCompilation makes it report what it does compile. With these 2 options together, you have a great chance that your target function will be compiled and you get a log statement to confirm that it does.

Next, I found that it is best to place a breakpoint just before your target section of code. But how does one set a breakpoint? Java breakpoints are useless for native code debugging, because Java debuggers don’t give you an opportunity to jump from Java code into native instructions. A breakpoint in Java interpreter code in JVM would be triggered countless number of times while JVM interprets your Java program. I found this to be impractical even for tiny handcrafted sample programs. Answer: you create a simple NOOP native function and set breakpoint in it!

I took inspiration from Fahd’s core dumper code He describes how to invoke native code from Java to cause a core dump (by deliberately dereferencing null pointer). We can take this example, but eliminate the core dump. Our function will do absolutely nothing – it will simply serve as a hook to hang the debugger breakpoint on.

The steps will be:

1. Create a Java class:


public class Hookable {

// load the library

static {

System.loadLibrary("nativelib");

}

// native method declaration

public native void here();

public static void main(String[] args) {

// any prep work

new Hookable().here();

// Your target code goes here

}

}

2. Compile your class with javac


javac Hookable.java

3. Generate JNI header file


javah -jni Hookable

4. Write C code for the function that you will use a breakpoint in gdb:


#include "Hookable.h"

void foo() {

int a=2;

return;

}

JNIEXPORT void JNICALL Java_Hookable

(JNIEnv *env, jobject obj) {

foo();

}

5. Compile the C code


gcc -fPIC -o libnativelib.so -shared -I$JAVA_HOME/include/linux/ -I$JAVA_HOME/include/ Hookable.c

6. Launch gdb to runJava under its control:


gdb --args java -XX:+AggressiveOpts -XX:+PrintCompilation Hookable

When gdb starts, add a breakpoint for foo. gdb will complain, because Java is not yet running, so foo can’t be found yet. Confirm that you really want to add this breakpoint by responding “y” to the prompt:


(gdb) break foo

Function "foo" not defined.

Make breakpoint pending on future shared library load? (y or [n]) <b>y</b>

Breakpoint 1 (foo) pending.

Run Java:


(gdb) run

Starting program: /usr/bin/java -XX:+AggressiveOpts -XX:+PrintCompilation Hookable

[Thread debugging using libthread_db enabled]

Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

[New Thread 0x7ffff7fce700 (LWP 7336)]

…

gdb will stop at the breakpoint in foo:

Breakpoint 1, 0x00007fffe48c96ff in foo ()

from /home/ivan2/dev/benchmarks/libnativelib.so

(gdb)

Now would be a good time to set up your debugging session. For example, tell gdb to display next machine instruction to be executed:


(gdb) display/i $pc

1: x/i $pc

=> 0x7fffe48c96ff <foo+4>: movl $0x2,-0x4(%rbp)

(gdb)

Exit foo and Java_Hookable, the 2 frames that exist solely as a marker and a place to hang breakpoint on


(gdb) fin

Run till exit from #0 0x00007fffe48c96ff in foo ()

(gdb) fin

Run till exit from #0 0x00007fffe48c9723 in Java_Hookable_here ()

Your debugging session is now stopped at the point marked “// Your target code goes here”. Next debugger step wioll take you out of native C code and back into Java realm, exactly where you want to be. Happy hunting for JIT output!

Confirmed: OpenJDK JIT compiler emits efficient conditional move machine codes

On popular programmers’ Q&A site stackoverflow.com the hit question of the last few years dealt with the effect of branching on program performance. I recommend that you read the brilliant explanation there, but if you don’t have much time, here’s the summary. Modern processors execute commands in a pipeline. While One command is finishing executing, the next one is already started, the next one is being prepared and so on. Pipelining can only work if processor knows which command will execute next. If the program is linear, determining next command is trivial. But if the program branches (think “if” statement), processor can only guess what comes next. If the processor guesses wrong, it would have to back out several steps of the wrong branch it took. Such backout would severely degrade performance.

Fortunately, modern processors have branch prediction mechanisms, which allow them to guess branching correctly when a particular branch exhibits consistent behavior. An example would be an if statement inside a loop. If code takes “then” branch consistently, processor will be able to make correct guess and performance will remain optimal.

So efficient code would avoid branching or at least make it branching predictable. One could assume that to write efficient code, programmer needs to be aware of branch prediction and code around it. An example in the linked Stack Overflow article shows how sorting data before processing it in a loop creates consistent branching and dramatically improves performance. Another trick would be to replace mini-branches with equivalent branchless code. For example, function min(a,b) is normally written as

if (a<b) return a; else return b; 

It could rewritten (for Java’s 32 bit signed ints, with an assumption on input range not causing overflow/underflow) as

return a-((b-a)>>31&1)*(a-b)

Should you write code in this obscure style to gain performance? I’m going to show that in Java this trick is unnecessary.

The key to the problem lies in the fact that in some cases, an if statement doesn’t produce branching! X86 processors (Intel and AMD) that power most PCs and great many servers have recognized Conditional Move instructions in Pentium Pro, released about 20 years ago in 1995. Conditional move instruction is a way to achieve branching effect without branching code. The meaning of this instruction is “move value from one place to another IF specified condition is met, do nothing otherwise”. This instruction can be used to determine the minimum of 2 values or to express a ternary operator (?:) in C or Java.

The Stack Overflow post cited benchmark results that suggest that C compilers are not consistent in emitting these efficient conditional move instructions. Java seemed to use conditional move, based on performance results. I ran a slightly different benchmark (computing minimum of 2 numbers using if-return and ternary operator), comparing run times between branch-friendly ordered input data and noisy random input data. There was no difference in execution times, suggesting that branching was not a factor.

I decided to check what instructions OpenJDK JIT compiler generates. For my tests I used Java 7 on my home Ubuntu Linux running on 64 bit AMD processor.


ivan2@bigbox:~$ java -version

java version "1.7.0_79"

OpenJDK Runtime Environment (IcedTea 2.5.5) (7u79-2.5.5-0ubuntu0.14.04.2)

OpenJDK 64-Bit Server VM (build 24.79-b02, mixed mode)

I wrote a primitive test case with simple min function:


public int ifmin(int a, int b) {

if (a<b) return a; else return b;

}

My goal was to confirm that this simple code will be compiled to native instructions as a conditional move (CMOVxx), not a conditiona jump (Jxx). My primitive function compiled to this Java bytecode:


0: iload_0

1: iload_1

2: if_icmpge 7

5: iload_0

6: ireturn

7: iload_1

8: ireturn

I caused JVM to compile the bytecode to native instructions and traced execution with gdb debugger. By the way, locating compiled code took quite a bit of effort. I probably missed an obvious approach, but in any case, I explain how I pinpointed JIT-generated machine code in a separate post. I verified that compilation happened in a log file:


Compilation events (4 events):

Event: 0.077 Thread 0x00007f03bc0a9000 1 ivan.Code::ifmin (9 bytes)

I was then able to trace the execution of the function and see the actual machine instructions. The relevant section of gdb session looked like this:


=> 0x7fffed060190: sub $0x18,%rsp

=> 0x7fffed060197: mov %rbp,0x10(%rsp)

The instructions above populated registers in preparation for comparison.


=> 0x7fffed06019c: cmp %ecx,%edx

The above instruction compared a with b


=> 0x7fffed06019e: mov %ecx,%eax

And the above moved b to result


=> 0x7fffed0601a0: <b>cmovl</b> %edx,%eax

Next: conditional move if a<b. The logic of this steps is like this: prior to this step, result=b (we just moved b to result). Now, if a<b,result will be overwritten with a. So if a<b, then result=a. Other wise, result remains unchanged, meaning result=b. Just what we wanted!

Next, finalize and return – nothing terribly interesting


=> 0x7fffed0601a3: add $0x10,%rsp

=> 0x7fffed0601a7: pop %rbp

=> 0x7fffed0601a8: test %eax,0xaf96e52(%rip) # 0x7ffff7ff7000

=> 0x7fffed0601ae: retq

Similarly, I confirmed that ternary operator is compiled into conditional move instruction. This Java code:


int a = aarr[1];

int b = barr[2];

res[3] = a<b?a:b;

produced similar machine code sequence, including conditional move:


=> 0x7fffed05fc9c: cmp %ebp,%r11d

The above compares a with b


=> 0x7fffed05fc9f: <b>cmovl</b> %r11d,%ebp

And this lines overwrites one with the other – conditionally!!

Conclusion: for simple conditionals, modern JIT compilers emit efficient branchless code. No additional programming tricks are required. Clever interview trick of computing minimum of 2 numbers through obscure bitwise code should remain confined to interview rooms – it does nothing to improve performance in real applications.