Discussion:
[Sqlgrey-users] Congratulations & PostgreSQL issue
Jean-Michel Pouré - GOOZE
2015-08-11 20:26:05 UTC
Permalink
Dear Lionel,

First, I would like to congratulate you for bringing SqlGrey to the
community. I just discovered SqlGrey and it solved my SPAM problems.

Together with SPF validation, DNSBL and other filtering techniques, I
now have nearly zero spam.

I am running Debian and I am a big fan of PostgreSQL. The automatic
population of the database and the SQL code using prepared queries was
a very nice surprise.

Previously, I tested other grey listing solutions and they don't have
the simplicity and power of SqlGrey. I am surprised to see that some
SqlGrey databases have been running years without incident.

There might be modifications needed to stand "huge" loads under
PostgreSQL. For example, I think of as much as 10.000 queries per
second. This does not happen very often, I agree.

First, I think it should be interesting to :

* Use special the inet type for IPs and subnets
http://www.postgresql.org/docs/current/static/datatype-net-types.html

I believe that using the inet type should be compatible with current
code in a fresh database.

SELECT '216.34.181'::cidr => "216.34.181.0/24"
SELECT '216.34.181.6'::inet => "216.34.181.6"

* IP4r type could also be used from managing ranges of servers and
spammers.IP4r reported 6000 queries per second on complex projects:
https://www.siafoo.net/article/53

* Add indexes on domain_awl and from_awl tables. The advantage of CIDR
is that it will be very well indexed.

Indexes allow PostgreSQL planner to make clean search plans, mainly on
indexes and reducing disc access to what should be zero.

Just my 2 cents!
Kind regards,
Jean-Michel
Lionel Bouton
2015-08-11 22:28:30 UTC
Permalink
Post by Jean-Michel Pouré - GOOZE
Dear Lionel,
First, I would like to congratulate you for bringing SqlGrey to the
community. I just discovered SqlGrey and it solved my SPAM problems.
Happy to be of service. Numerous others have helped too (see the CONTRIB
file).
Post by Jean-Michel Pouré - GOOZE
Together with SPF validation, DNSBL and other filtering techniques, I
now have nearly zero spam.
I am running Debian and I am a big fan of PostgreSQL. The automatic
population of the database and the SQL code using prepared queries was
a very nice surprise.
Previously, I tested other grey listing solutions and they don't have
the simplicity and power of SqlGrey. I am surprised to see that some
SqlGrey databases have been running years without incident.
This is a testament to the robustness of both the database systems and
SQLgrey.
Post by Jean-Michel Pouré - GOOZE
There might be modifications needed to stand "huge" loads under
PostgreSQL. For example, I think of as much as 10.000 queries per
second. This does not happen very often, I agree.
In theory you should be able to reach these levels with PostgreSQL
easily even with write-intensive situations like SPAM storms if you use
enterprise level SSD to store your database. I've not tested SQLgrey
itself at these levels though (you may need to distribute the workload
on multiple instances).
Post by Jean-Michel Pouré - GOOZE
* Use special the inet type for IPs and subnets
http://www.postgresql.org/docs/current/static/datatype-net-types.html
I believe that using the inet type should be compatible with current
code in a fresh database.
SELECT '216.34.181'::cidr => "216.34.181.0/24"
SELECT '216.34.181.6'::inet => "216.34.181.6"
I considered this very early (and in fact I ran some code using these
datatypes the very first day I started developing SQLgrey). PostgreSQL
is by far my favourite RDBMS but I wanted SQLgrey to avoid being a
PostgreSQL-only project so I refrained from using proprietary datatypes.
The performance difference should be negligible as a string
representation doesn't have much storage overhead (probably <3x) and
allows us to use a very simple (and fast) string operation in Perl and
equality for matching class C networks using the src column. It is used
like this when using greymethod "smart" (default) or "classc". Using a
cidr datatype could be faster but I'm not aware of a single SQLgrey
installation where this would make a difference.

Switching to the cidr datatype could probably give a minor speedup in
PostgreSQL, but would stop SQLgrey support (or make it complex) for
SQLite and MySQL/MariaDB and would probably bring a minor slowdown in
the Perl part.
Post by Jean-Michel Pouré - GOOZE
* IP4r type could also be used from managing ranges of servers and
https://www.siafoo.net/article/53
I'm not familiar with the IP4r module but I guess that my answer above
would apply to it too.
Post by Jean-Michel Pouré - GOOZE
* Add indexes on domain_awl and from_awl tables.
They are already indexed: the primary key automatically adds a unique
index which should be used by all queries (at least it was designed to).
Post by Jean-Michel Pouré - GOOZE
The advantage of CIDR
is that it will be very well indexed.
The string index is very efficient too, the only real drawback is the
storage. If you have a SQLgrey database where shaving half of the
storage requirement would be beneficial the cidr datatype might be
interesting to look at but I seriously doubt that such a database exists.

Best regards,

Lionel

------------------------------------------------------------------------------
Karl O. Pinc
2015-08-12 00:03:13 UTC
Permalink
On Wed, 12 Aug 2015 00:28:30 +0200
Post by Lionel Bouton
Post by Jean-Michel Pouré - GOOZE
There might be modifications needed to stand "huge" loads under
PostgreSQL. For example, I think of as much as 10.000 queries per
second. This does not happen very often, I agree.
In theory you should be able to reach these levels with PostgreSQL
easily even with write-intensive situations like SPAM storms if you
use enterprise level SSD to store your database. I've not tested
SQLgrey itself at these levels though (you may need to distribute the
workload on multiple instances).
You could always throw hardware at it. The latest Postgres
(at least) has multi-master capability.



Karl <***@meme.com>
Free Software: "You don't pay back, you pay forward."
-- Robert A. Heinlein

------------------------------------------------------------------------------
Jean-Michel Pouré - GOOZE
2015-08-12 08:57:21 UTC
Permalink
Dear Lionel,

Thanks for your answers. I agree with most of them, except the indexing
issue. I a completely opposed to database clusters, they do not work
and never will.

PostgreSQL (and other databases) work like that:

* Each SQL query is parsed and analysed before execution. The SQL
optimizer rewrites the query and executes the best "guessed" query
plan. A query plan is a number of steps needed to execute the query.

* There are basically two steps in a query plan : indexes lookups (aka
memory) and disc lookups.

* Disc lookups will trigger more CPU cycles. If your table "from_awl"
has 1.000.000 entries and you need to go through all records without
indexing, it will take you at least 1.000.000 CPU cycles if you using
an SSD disc and ten times more if using a SATA disc. On the converse,
indexes allow the database to pick-up exactly the needed records. So if
there are only let's say 5 records in the result and you are using
indexes, it will eat-up only 5 CPU cycles.

The number of CPU cycles is approximate, but the idea behind it there.

Even if a table has a unique index on OID, it will not speed-up a
search on a table. You need indexing on EACH colum which is used in
queries where speed matters.

* If we take the example of from_awl table, there are only two indexes:
main OID and lastseen. Were need more:

CREATE INDEX from_awl_sendername
ON from_awl
USING btree
(sender_name);

CREATE INDEX from_awl_senderdomain
ON from_awl
USING btree
(sender_domain);

CREATE INDEX from_awl_src
ON from_awl
USING btree
(src);

CREATE INDEX from_awl_fseen
ON from_awl
USING btree
(first_seen);

A nice PotgreSQL server with enough memory and SSD disc can reach a
huge amount of queries. I think 100.000 queries each second should not
be a problem.

If you need to reconnect for each query, you can create a dedicated
user with read-only access to tables and allow authentication without
password in PostgreSQL.

There are also bottlenecks on client side. I did not read the Perl
code, but all data searching should be done at SQL level and avoid
searching using Perl code.

To know the real performance of a database, it is possible to log slow
queries. In PostgreSQL it is done by setting in postgresql.conf:
log_min_duration_statement = 30

This is 30 milliseconds.

It should be possible to create a small script and populate the
database with millions of records and see how it works on high loads.
No SQL query should run in more than 30 ms.

Kind regards,
Jean-Michel


------------------------------------------------------------------------------
Jean-Michel Pouré - GOOZE
2015-08-12 10:10:33 UTC
Permalink
Another issue about:
CREATE INDEX connect_idx
ON "connect"
USING btree
(src , sender_domain , sender_name);

This is an index on three columns. It is effective only if the SQL
query includes these three columns. It can be very fast, but it should
not be considered the same as THREE different indexes. I agree it is
not very well documented in PostgreSQL documentation, but all databases
work the same and this is only effective in a limited number of
queries.

Anyway, I don't have enough data to start a performance test. I will
write another message on the list to launch a small testing project.


Kind regards,
Jean-Michel
Lionel Bouton
2015-08-12 12:56:57 UTC
Permalink
Post by Jean-Michel Pouré - GOOZE
CREATE INDEX connect_idx
ON "connect"
USING btree
(src , sender_domain , sender_name);
This is an index on three columns. It is effective only if the SQL
query includes these three columns.
No, that's not the case. It is used when only the first or the first and
second column are used in a query.
Post by Jean-Michel Pouré - GOOZE
It can be very fast, but it should
not be considered the same as THREE different indexes. I agree it is
not very well documented in PostgreSQL documentation, but all databases
work the same and this is only effective in a limited number of
queries.
And the order of the columns in this index was used so that all SQLgrey
queries can be sped up by this index. Having three indexes instead of
one would be both slower when updating or inserting data in the database
and when selecting records based on conditions on multiple columns like
SQLgrey does.
You only have to use EXPLAIN on queries done by SQLgrey on your database
to verify this.

Best regards,

Lionel.

------------------------------------------------------------------------------
Jean-Michel Pouré - GOOZE
2015-08-12 13:05:09 UTC
Permalink
Post by Lionel Bouton
You only have to use EXPLAIN on queries done by SQLgrey on your database
to verify this.
Sorry, I replied before reading this.
I am convinced, your index is very interesting indeed.

Kind regards,
Jean-Michel

Lionel Bouton
2015-08-12 12:52:27 UTC
Permalink
Post by Jean-Michel Pouré - GOOZE
[...]
main OID
No:

sqlgrey=> \d from_awl
Table "public.from_awl"
Column | Type | Modifiers
---------------+-----------------------------+-----------
sender_name | character varying(64) | not null
sender_domain | character varying(255) | not null
src | character varying(39) | not null
first_seen | timestamp without time zone | not null
last_seen | timestamp without time zone | not null
Indexes:
"from_awl_pkey" PRIMARY KEY, btree (src, sender_domain, sender_name)
"from_awl_lseen" btree (last_seen)


There is no index on OID in the SQLgrey database.
Post by Jean-Michel Pouré - GOOZE
CREATE INDEX from_awl_sendername
ON from_awl
USING btree
(sender_name);
CREATE INDEX from_awl_senderdomain
ON from_awl
USING btree
(sender_domain);
CREATE INDEX from_awl_src
ON from_awl
USING btree
(src);
CREATE INDEX from_awl_fseen
ON from_awl
USING btree
(first_seen);
You don't need these indexes and switching from the from_awl_pkey to
these indexes would slow SQLgrey. Run EXPLAIN on any query done by
SQLgrey and you should see that it uses the from_awl_pkey index (if it's
not, your database is probably not in a sane state).

Best regards,

Lionel

------------------------------------------------------------------------------
Jean-Michel Pouré - GOOZE
2015-08-12 13:01:15 UTC
Permalink
Post by Lionel Bouton
You don't need these indexes and switching from the from_awl_pkey to
these indexes would slow SQLgrey. Run EXPLAIN on any query done by
SQLgrey and you should see that it uses the from_awl_pkey index (if it's
not, your database is probably not in a sane state).
Okay, thanks. I am not sure that an index on several columns is useful as a primary index."from_awl_pkey" PRIMARY KEY, btree (src, sender_domain, sender_name)
This is not the same as three indexes and can be very fast when a query is being run on all those three columns. Otherwize, it should be slow.
I will try to create a fake database and will report later.
Kind regards,Jean-Michel
Loading...