Discussion:
[Sqlgrey-users] Add primary keys to tables?
Holger Steinhauer
2016-01-26 17:31:48 UTC
Permalink
Hi all.

I am working on a small private project which also involves sqlgrey integration. Lately, I was playing around with the data and just came to the point I was before: The tables do not have real primary keys. Did anyone try to run sqlgrey with an altered database schema, where the ID is generic and generated for each entry?
Also, is sqlgrey still in development? Just asking if it is “worth” raising an Feature Request :)

Any help / hints are highly appreciated.

Cheers,
Holger
Jean-Michel Pouré - GOOZE
2016-02-08 20:47:17 UTC
Permalink
Post by Holger Steinhauer
I am working on a small private project which also involves sqlgrey
integration. Lately, I was playing around with the data and just came
to the point I was before: The tables do not have real primary keys.
Did anyone try to run sqlgrey with an altered database schema, where
the ID is generic and generated for each entry? 
Also, is sqlgrey still in development? Just asking if it is “worth”
raising an Feature Request :)
Any help / hints are highly appreciated.
I posted the same message on Tue, 11 Aug 2015 22:26:05 +0200
Just my 2 cents, I guess this is to allow huge loads, maybe hundred thousands of queries every second.

Kind regards,
Jean-Michel
Karl O. Pinc
2016-02-08 22:09:58 UTC
Permalink
On Mon, 08 Feb 2016 21:47:17 +0100
Post by Holger Steinhauer
I am working on a small private project which also involves sqlgrey
integration. Lately, I was playing around with the data and just
came to the point I was before: The tables do not have real primary
keys. Did anyone try to run sqlgrey with an altered database
schema, where the ID is generic and generated for each entry? 
Also, is sqlgrey still in development? Just asking if it is “worth”
raising an Feature Request :)
You'd have to put forward a use-case. Otherwise it's wasted
storage.



Karl <***@meme.com>
Free Software: "You don't pay back, you pay forward."
-- Robert A. Heinlein
Lionel Bouton
2016-02-08 23:14:56 UTC
Permalink
Hi,

I didn't answer the original message at first because :
- I believed I reacted on a similar subject several years ago,
- the message title doesn't actually make sense to me.

I suspect some people are coming here with a basic ORM background and
don't fully understand the difference between primary keys and object
ids (PKs/OIDs in the following). So here is a complete answer..
Post by Karl O. Pinc
On Mon, 08 Feb 2016 21:47:17 +0100
Post by Holger Steinhauer
I am working on a small private project which also involves sqlgrey
integration. Lately, I was playing around with the data and just
came to the point I was before: The tables do not have real primary
keys.
Sure they have at least for tables where it makes sense. The notable
exception is "connect" because enforcing unicity with a PK would both
slow down access to this very time-sensitive table and force SQLgrey to
handle collisions which it really doesn't need to.
I suspect that what you call "real primary keys" are in fact OIDs (not
in the "internal database value" sense, but in the "ORM object id"
sense). From the top of my head, OIDs are only useful in the following
cases :
- identifying objects that don't have any obvious primary key (for
example an user on a web application could want to change his login and
not break all relations with the data linked to him/her),
- simplifying ORMs code and especially how they manage foreign keys
(some ORMs can handle composite PKs though),
- making indexes on foreign keys accesses faster.
We don't use any ORM or foreign keys in SQLgrey and all the objects we
manage are uniquely defined with invariable properties so we are free to
avoid OIDs as they would serve no purpose at all and just slow us down.
Post by Karl O. Pinc
Post by Holger Steinhauer
Did anyone try to run sqlgrey with an altered database
schema, where the ID is generic and generated for each entry?
I'm not sure why you would do such a thing.
Post by Karl O. Pinc
Post by Holger Steinhauer
Also, is sqlgrey still in development? Just asking if it is “worth”
raising an Feature Request :)
There hasn't been any development for a while but I'm not aware of any
bug surfacing for a long time either. AFAIK it just works.
I'm not likely to put much effort in it. 99% of my developments are in
Ruby now and while Perl isn't bad it's not as comfortable for me. So the
latest changes have been implemented by other developers.
I thought about rewriting it in Ruby some years ago. This would have
been an opportunity to make the code more readable (I think I could do a
better job on this front) but as I said, it just works (I still use it
myself on several personal and corporate servers).
Post by Karl O. Pinc
You'd have to put forward a use-case. Otherwise it's wasted storage.
Not only would it be wasted storage but it would slow down the database
too (dead weight slows you down, for example not being able to keep the
whole database in RAM because of the additional weight would have a
significant impact on large sites).
Jean-Michel Pouré seemed to imply that it could have performance
benefits bat I can't see how. There's not a single access to the
database that could even use an OID, every object must be accessed based
on conditions on its attributes (parts of email addresses, which are the
only information relevant to SQLgrey's database a mail server can
transmit to it).

Best regards,

Lionel
Lionel Bouton
2016-02-08 23:30:12 UTC
Permalink
Post by Lionel Bouton
Hi,
- I believed I reacted on a similar subject several years ago,
- the message title doesn't actually make sense to me.
I suspect some people are coming here with a basic ORM background and
don't fully understand the difference between primary keys and object
ids (PKs/OIDs in the following). So here is a complete answer..
If people want to dig more OIDs are in fact surrogate keys (by
opposition to natural keys which are used in SQLgrey). There's a ton of
documentation/discussion on the subject.

Best regards,

Lionel
incredibleh0lg
2016-02-09 13:38:24 UTC
Permalink
Hi.
Post by Lionel Bouton
Post by Lionel Bouton
Hi,
- I believed I reacted on a similar subject several years ago,
- the message title doesn't actually make sense to me.
I suspect some people are coming here with a basic ORM background and
don't fully understand the difference between primary keys and object
ids (PKs/OIDs in the following). So here is a complete answer..
If people want to dig more OIDs are in fact surrogate keys (by
opposition to natural keys which are used in SQLgrey). There's a ton of
documentation/discussion on the subject.
I understand that this can bring up discussions. But I would disagree with the term OIDs (and thanks for the kind words about my “basic" ORM background...). Basically one can choose between Natural Keys and Generated / Artificial Keys. In my experience natural keys are always not the best choice. In every real world commercial project you will end up in changing your keys, because the natural candidates that are so unchangeable, do change. So generic keys are - at least for me - always the safest way to go.
One can still use generic primary keys and define constraints that the combination of local part and domain have to be unique.

Regarding the speed part: It is a completely valid point to decide to not use real primary keys to improve the performance. But in a normal database design you can achieve this by having views or pre-aggregated tables - or use one of the NoSQL databases.

Another point made in this discussion was to give a use case: Third party applications using the data in the database. In my case, I’d like to provide a similar functionality as SQLgrey Webinterface provides. So present the user with relevant entries - e.g. in the connect table - and let him decide what to do with it. In the current format this means, that the request has to carry the WHOLE entry as parameter to get it moved or deleted. This is pretty wasteful and it is the reason why I asked if somebody tested a slightly adjusted schema where generated primary keys are added by the database.In theory, this shouldn’t break the application.

Last but not least, the argument with “wasted space” is insignificant today. The only concern I have about this is that this can be an issue for servers with huge traffic, as the keys can get pretty big quite quickly. So having this as an optional feature would be the best.

As there is no real further development, it is maybe a good exercise to reimplement it with support for NoSQL databases (which, btw, use generic keys too).
But thanks for the feedback anyway - sqlgrey is a really reliable tool which is doing great work for years now. So please do not see this is critic, it was more a feature request / question about experiences :)

Cheers,
Holger
Lionel Bouton
2016-02-09 17:28:35 UTC
Permalink
Hi,
Post by incredibleh0lg
[...]
Another point made in this discussion was to give a use case: Third party applications using the data in the database. In my case, I’d like to provide a similar functionality as SQLgrey Webinterface provides. So present the user with relevant entries - e.g. in the connect table - and let him decide what to do with it.
I don't think it's a good example because you definitely don't want to
modify the connect table content in the back of SQLgrey : this is the
table it uses to implement the basic greylisting algorithm. So if you
want to modify the connect table you actually want to modify the
greylisting algorithm. It seems to me that the two sane approaches if
that's what you really want would be to :
- implement your changes in SQLgrey itself, not in a separate application,
- replace SQLgrey completely or fork it to implement your own algorithm
if what you want to do is very different or solves a specific problem
which isn't commonly encountered by SQLgrey uers.

The same limitations apply to the config/from_awl/domain_awl tables (awl
means *automatic* white list), if you try to modify them behind
SQLgrey's back you are actually designing a new greylisting program.

The only tables I created for external configuration are the
opt[in|out]_[email|domain] tables.
Post by incredibleh0lg
In the current format this means, that the request has to carry the WHOLE entry as parameter to get it moved or deleted.
In fact the connect table is a special case and doesn't even guarantee
that each entry is unique so you can't event select a single entry
deterministically.
The tables that are supposed to be managed externally only have a single
column : so the "WHOLE" entry is actually a single value which is
obviously known by any application which would be designed to insert or
delete entries (update doesn't make sense for these tables)... See below
if "obviously" is not a given for you.
Post by incredibleh0lg
This is pretty wasteful and it is the reason why I asked if somebody tested a slightly adjusted schema where generated primary keys are added by the database.In theory, this shouldn’t break the application.
It wouldn't break SQLgrey but it would make developing the application
managing SQLgrey's configuration tables more complex and bug-prone (see
a detailed explanation below).

I think you are probably approaching the SQLgrey database and an
external application's database as a single entity (see why I think so
below too), which I don't think is a good way to approach the external
configuration of SQLgrey.

In a single database for any moderately complex application surrogate
keys are nearly always the best choice for various reasons. For a
database dedicated to a trivial service like SQLgrey this is clearly not
the case.

Let's use a practical example to demonstrate.
Take the optout_email table (the other configuration tables are similar:
all have a single column) and create an application managing its content
(creating entries when users want to optout and deleting them when they
want to activate greylisting again).
If you use surrogate keys in the optout_email table the first question
you have to ask yourself is : does my application separate database's
user objects store :
- user emails,
- references to the surrogate keys in optout_email,
- both
and what is used among them to reference the optout_email entries.

You made clear that don't want to use the user emails to reference
optout_email entries so the only 2 choices for you remaining are the
last ones.
But actually if you store the references to surrogate keys you will most
probably want to keep a copy of the email address too. Not only because
it would be convenient in your application which would probably access
the Postfix user database (or a database used to create the Postfix one
which obviously must have the email addresses itself) but also because
optout_email won't have the email if the user didn't choose to opt out.
If you want to avoid storing the address at all costs you would make the
application and Postfix depend on a SQLgrey change to store all your
users' email addresses to solve this. This is not SQLgrey's role: it's a
greylisting service, not an email user database. If you want to create
your user database you obviously won't try to do so by installing a
greylisting service.

In conclusion there's no real choice once you have chosen to use
surrogate keys : you will have to store both the email and the reference
to the surrogate key in the external database in one way or another.

So now if you really want to use the surrogate keys instead of the email
itself you have to maintain consistency between your reference and the
email addresses stored in two separate databases with 2 different
codebases managed by 2 different sets of people. This is clearly (I hope
so anyway) not something you want to do *at all*. By avoiding surrogate
keys SQLgrey actually forces a potential separate database to be
inherently consistent with its own database.

In case this isn't crystal clear, as an example of the consistency
problems consider these common scenarios :
- admins create entries manually until enough users ask for direct
access and install the configuration application,
- admins use batches to create initial entries or maintain a subset of
the optout_email table in parallel to the configuration application,
these batches probably aren't aware of the existence of the
configuration application and don't know how to update its database,
- after a crash, on restoring the 2 separate databases (SQLgrey and
configuration application) you can get a more recent SQLgrey database
with entries not (yet) referenced by the other restored database version,
- ...
All these scenario imply that there will be entries in optout_email not
yet referenced by the configuration application.
When creating new users with addresses already in optout_email, how will
you create the references to these entries ? The only way is to check
that for each user you create in your application there isn't already an
entry in the optout_email table with the same email address. So for new
users your reference to the optout_email table is the email address and
for existing users the reference is the surrogate key : it's bad design
and a loud and clear invitation to bugs.

Now let's see how you would design the application without surrogate keys.
The application has its user objects with their own email attribute.
- when you want to activate optout, you use your application's email
attribute to find or create an entry in the optout_email table if needed,
- when you want to deactivate it, you find and delete the entry with the
same email,
- when you change the email address of a user, you remove the old entry
(if it exists) and create a new one if applicable.
This is *trivial* logic. An environment where this is difficult to do
would have to prevent you to make the most basic SQL queries imaginable.

Why should we even consider a change only useful to people starting to
develop external tools by shooting themselves in the foot when choosing
their development environment ?

Best regards,

Lionel
incredibleh0lg
2016-02-09 22:10:52 UTC
Permalink
Hi.

Thanks @Lionel for the long explanation. I really like to discuss this :)

So let’s go into it.
Post by Lionel Bouton
Hi,
Post by incredibleh0lg
[...]
Another point made in this discussion was to give a use case: Third party applications using the data in the database. In my case, I’d like to provide a similar functionality as SQLgrey Webinterface provides. So present the user with relevant entries - e.g. in the connect table - and let him decide what to do with it.
I don't think it's a good example because you definitely don't want to
modify the connect table content in the back of SQLgrey : this is the
table it uses to implement the basic greylisting algorithm. So if you
want to modify the connect table you actually want to modify the
greylisting algorithm. It seems to me that the two sane approaches if
- implement your changes in SQLgrey itself, not in a separate application,
- replace SQLgrey completely or fork it to implement your own algorithm
if what you want to do is very different or solves a specific problem
which isn't commonly encountered by SQLgrey uers.
The same limitations apply to the config/from_awl/domain_awl tables (awl
means *automatic* white list), if you try to modify them behind
SQLgrey's back you are actually designing a new greylisting program.
The only tables I created for external configuration are the
opt[in|out]_[email|domain] tables.
Ok, my bad (or more: my English skills :)) I see and treat SQLgrey’s database
as what it is - a database used by an external tool. The “relevant entries” is
something I still try to get my head around, as I will have to implement some
logic *outside* SQLgrey (+ it’s database) to filter entries. This is definitely not
SQLgrey’s job. It is and will always be the greylisting solution of my choice.
My “problem”, I try to solve, is to provide some kind of integrated UI where
users can manage their own Greylisting settings (opting out, whitelisting one
specific email etc.) SQLgrey should do what it is really good at: greylisting.
Post by Lionel Bouton
Post by incredibleh0lg
In the current format this means, that the request has to carry the WHOLE entry as parameter to get it moved or deleted.
In fact the connect table is a special case and doesn't even guarantee
that each entry is unique so you can't event select a single entry
deterministically.
The tables that are supposed to be managed externally only have a single
column : so the "WHOLE" entry is actually a single value which is
obviously known by any application which would be designed to insert or
delete entries (update doesn't make sense for these tables)... See below
if "obviously" is not a given for you.
Fair enough. I was just thinking about an option to get an optional surrogate
key (to stick with your terminology - every day a new word / term :D) to make
it easier to address an entry. As far as I understand the logic for whitelisting an
email currently “sitting in” greylisting, we have to create a corresponding entry
with the data from the ‘connect' table within the ‘from_awl’ table and delete the
entry in ‘connect’. At the moment I would have to submit 4 different values. As
the entries in ‘connect’ are not necessarily unique, a surrogate key would make
this easier. But life isn’t always easy, is it :)
Post by Lionel Bouton
Post by incredibleh0lg
This is pretty wasteful and it is the reason why I asked if somebody tested a slightly adjusted schema where generated primary keys are added by the database.In theory, this shouldn’t break the application.
It wouldn't break SQLgrey but it would make developing the application
managing SQLgrey's configuration tables more complex and bug-prone (see
a detailed explanation below).
I think you are probably approaching the SQLgrey database and an
external application's database as a single entity (see why I think so
below too), which I don't think is a good way to approach the external
configuration of SQLgrey.
See above, I wasn’t really specific enough. So sorry again :)
Post by Lionel Bouton
In a single database for any moderately complex application surrogate
keys are nearly always the best choice for various reasons. For a
database dedicated to a trivial service like SQLgrey this is clearly not
the case.
Agreed. Just the point I tried to make in my previous email.
Post by Lionel Bouton
Let's use a practical example to demonstrate.
all have a single column) and create an application managing its content
(creating entries when users want to optout and deleting them when they
want to activate greylisting again).
If you use surrogate keys in the optout_email table the first question
you have to ask yourself is : does my application separate database's
- user emails,
- references to the surrogate keys in optout_email,
- both
and what is used among them to reference the optout_email entries.
You made clear that don't want to use the user emails to reference
optout_email entries so the only 2 choices for you remaining are the
last ones.
But actually if you store the references to surrogate keys you will most
probably want to keep a copy of the email address too. Not only because
it would be convenient in your application which would probably access
the Postfix user database (or a database used to create the Postfix one
which obviously must have the email addresses itself) but also because
optout_email won't have the email if the user didn't choose to opt out.
If you want to avoid storing the address at all costs you would make the
application and Postfix depend on a SQLgrey change to store all your
users' email addresses to solve this. This is not SQLgrey's role: it's a
greylisting service, not an email user database. If you want to create
your user database you obviously won't try to do so by installing a
greylisting service.
Well, I wouldn’t mind to use the email address - although this can be tricky
in itself, depending on how the data gets transfered (http-encoded email
addresses can be a pain, esp. with + addresses, which I personally use
extensively).
I absolutely get your point and agree 100%. My application should not
“hijack” SQLgrey’s schema (and was never planned to do it). Otherwise it
would be worth implementing my own greylisting service. Which would be
an interesting challenge in itself, but not very high on my long ToDo list…
Post by Lionel Bouton
In conclusion there's no real choice once you have chosen to use
surrogate keys : you will have to store both the email and the reference
to the surrogate key in the external database in one way or another.
So now if you really want to use the surrogate keys instead of the email
itself you have to maintain consistency between your reference and the
email addresses stored in two separate databases with 2 different
codebases managed by 2 different sets of people. This is clearly (I hope
so anyway) not something you want to do *at all*. By avoiding surrogate
keys SQLgrey actually forces a potential separate database to be
inherently consistent with its own database.
In case this isn't crystal clear, as an example of the consistency
- admins create entries manually until enough users ask for direct
access and install the configuration application,
- admins use batches to create initial entries or maintain a subset of
the optout_email table in parallel to the configuration application,
these batches probably aren't aware of the existence of the
configuration application and don't know how to update its database,
- after a crash, on restoring the 2 separate databases (SQLgrey and
configuration application) you can get a more recent SQLgrey database
with entries not (yet) referenced by the other restored database version,
- ...
All these scenario imply that there will be entries in optout_email not
yet referenced by the configuration application.
When creating new users with addresses already in optout_email, how will
you create the references to these entries ? The only way is to check
that for each user you create in your application there isn't already an
entry in the optout_email table with the same email address. So for new
users your reference to the optout_email table is the email address and
for existing users the reference is the surrogate key : it's bad design
and a loud and clear invitation to bugs.
Absolutely agreed. And on this special occasion, I think it is absolutely fine to
not normalise the data. It is simply not necessary or helpful. The complete
opposite is the case: this would definitely have an performance impact.
Post by Lionel Bouton
Now let's see how you would design the application without surrogate keys.
The application has its user objects with their own email attribute.
- when you want to activate optout, you use your application's email
attribute to find or create an entry in the optout_email table if needed,
- when you want to deactivate it, you find and delete the entry with the
same email,
- when you change the email address of a user, you remove the old entry
(if it exists) and create a new one if applicable.
This is *trivial* logic. An environment where this is difficult to do
would have to prevent you to make the most basic SQL queries imaginable.
Why should we even consider a change only useful to people starting to
develop external tools by shooting themselves in the foot when choosing
their development environment ?
Couldn’t agree more.

Well, I think it was worth asking anyway. Thanks again for the very
detailed answer.

All the best,
Holger

Jean-Michel Pouré - GOOZE
2016-02-09 18:41:00 UTC
Permalink
Jean-Michel Pouré seemed to imply that it could have performance
benefits bat I can't see how. There's not a single access to the
database that could even use an OID, every object must be accessed based
on conditions on its attributes (parts of email addresses, which are the
only information relevant to SQLgrey's database a mail server can
transmit to it).
No I don't meant that. As it is the database structure is good.
Loading...