Discussion:
[Sqlgrey-users] Illegal mix of collations.
Philippe Chaintreuil
2010-01-20 13:34:39 UTC
Permalink
Hey all.

I think my MySQL DB isn't set up 100% the way SQLGrey would like.
Anytime I get a non-ASCII character in an e-mail address, MySQL pukes
all over SQLGrey's shoes. SQLGrey is forced to disconnect and reconnect
and the message is allowed through leaving a chink in my anti-spam armor.

I see lines like this in my logs:
--------------------------------------------------------------------
Jan 20 02:10:01 hobbes Illegal: mix of collations
(latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for
operation '=', reconnecting to DB

Jan 20 02:10:01 hobbes sqlgrey: warning: Use of uninitialized value in
concatenation (.) or string at /usr/sbin/sqlgrey line 1154.

Jan 20 02:10:01 hobbes sqlgrey: dbaccess: error: couldn't access
from_awl table:

Jan 20 02:10:01 hobbes sqlgrey: dbaccess: warning: couldn't do query:

Jan 20 02:10:01 hobbes UPDATE: from_awl SET last_seen = NOW(),
first_seen = first_seen WHERE sender_name = 'aloïsheinzfoiqu' AND
sender_domain = 'gmail.com' AND src = '222.216.29.53':

Jan 20 02:10:01 hobbes Illegal: mix of collations
(latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for
operation '=', reconnecting to DB
--------------------------------------------------------------------

Now, I just know enough about MySQL to be able to shoot myself in the
foot.... And I'm a dumb American who doesn't usually need to step
outside the ASCII garden and deal with other characters, so it's all
kind of foreign to me.... [Pun intended.]

Can anyone help me figure out what commands I need to issue on my
database to make this error go away?

Many thanks in advance.

-- Philippe Chaintreuil
Kenneth Marshall
2010-01-20 14:21:53 UTC
Permalink
Hi Philippe,

You will need to use the collation for MySQL that accepts any
8-bit character. For our PostgreSQL database it was SQL_ASCII.
As you found out, SQLGrey is not collation or UTF-8 safe.

Cheers,
Ken
Post by Philippe Chaintreuil
Hey all.
I think my MySQL DB isn't set up 100% the way SQLGrey would like.
Anytime I get a non-ASCII character in an e-mail address, MySQL pukes
all over SQLGrey's shoes. SQLGrey is forced to disconnect and reconnect
and the message is allowed through leaving a chink in my anti-spam armor.
--------------------------------------------------------------------
Jan 20 02:10:01 hobbes Illegal: mix of collations
(latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for
operation '=', reconnecting to DB
Jan 20 02:10:01 hobbes sqlgrey: warning: Use of uninitialized value in
concatenation (.) or string at /usr/sbin/sqlgrey line 1154.
Jan 20 02:10:01 hobbes sqlgrey: dbaccess: error: couldn't access
Jan 20 02:10:01 hobbes UPDATE: from_awl SET last_seen = NOW(),
first_seen = first_seen WHERE sender_name = 'alo?sheinzfoiqu' AND
Jan 20 02:10:01 hobbes Illegal: mix of collations
(latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for
operation '=', reconnecting to DB
--------------------------------------------------------------------
Now, I just know enough about MySQL to be able to shoot myself in the
foot.... And I'm a dumb American who doesn't usually need to step
outside the ASCII garden and deal with other characters, so it's all
kind of foreign to me.... [Pun intended.]
Can anyone help me figure out what commands I need to issue on my
database to make this error go away?
Many thanks in advance.
-- Philippe Chaintreuil
------------------------------------------------------------------------------
Throughout its 18-year history, RSA Conference consistently attracts the
world's best and brightest in the field, creating opportunities for Conference
attendees to learn about information security's most important issues through
interactions with peers, luminaries and emerging and established companies.
http://p.sf.net/sfu/rsaconf-dev2dev
_______________________________________________
Sqlgrey-users mailing list
https://lists.sourceforge.net/lists/listinfo/sqlgrey-users
Philippe Chaintreuil
2010-01-20 14:40:59 UTC
Permalink
Post by Kenneth Marshall
You will need to use the collation for MySQL that accepts any
8-bit character. For our PostgreSQL database it was SQL_ASCII.
As you found out, SQLGrey is not collation or UTF-8 safe.
Thanks for the quick response Ken.

I'm appreciative, but that's kind of gobblety-gook to me still....
It's like you told me that the flux-capacitor on my DeLorean is set to
the wrong quantum state; it might help Doc Brown, but until you tell me
that it means I have to plug the red plug into the brown slot, it's not
going to get me back to the future.

I don't know if you're the guy to ask, since you mention you're using
PostgreSQL, but I guess I'm looking for commands to run; direct actions
to take.

How can I turn off collation on MySQL? And get my (preferrably
existing) database switched to something that won't have problems with
umlauts-over-Os or any other crazy characters that might end up in an
e-mail address? And how can I verify it's correct? (I don't get these
e-mails very often: less than once a month.)

-- Philippe Chaintreuil
Bill McGonigle
2010-01-20 18:13:13 UTC
Permalink
Post by Philippe Chaintreuil
I don't know if you're the guy to ask, since you mention you're using
PostgreSQL, but I guess I'm looking for commands to run; direct actions
to take.
I'm also using PostgreSQL, but the HOWTO says to:

## MySQL
...
# Howto:
Launch the 'mysql' command-line utility with admin rights.
Create a sqlgrey database:
- For MySQL < 5.0
Post by Philippe Chaintreuil
CREATE DATABASE sqlgrey;
- For MYSQL >= 5.0
Post by Philippe Chaintreuil
CREATE DATABASE sqlgrey CHARACTER SET latin1;
Did you create your database with the correct character set? If so, the
docs may need an update. If not, perhaps you can dump/drop/create/load?

Here's a page about MySQL character sets:
http://dev.mysql.com/doc/refman/5.1/en/charset.html

-Bill
--
Bill McGonigle, Owner
BFC Computing, LLC
http://bfccomputing.com/
Telephone: +1.603.448.4440
Email, IM, VOIP: ***@bfccomputing.com
VCard: http://bfccomputing.com/vcard/bill.vcf
Social networks: bill_mcgonigle/bill.mcgonigle
Philippe Chaintreuil
2010-01-20 18:55:55 UTC
Permalink
Post by Bill McGonigle
## MySQL
...
Launch the 'mysql' command-line utility with admin rights.
- For MySQL < 5.0
Post by Philippe Chaintreuil
CREATE DATABASE sqlgrey;
- For MYSQL >= 5.0
Post by Philippe Chaintreuil
CREATE DATABASE sqlgrey CHARACTER SET latin1;
Did you create your database with the correct character set? If so, the
docs may need an update. If not, perhaps you can dump/drop/create/load?
I saw that after my first post. It may not have been there when I
first installed SQLGrey (years ago).

So someone directed me towards some simliar links to the one you posted:

http://dev.mysql.com/doc/refman/5.1/en/charset-charsets.html
http://kb.siteground.com/article/How_to_change_the_collation_for_all_tables_in_db_to_UTF8.html

So I tried to do this:

----------------------------------------------------------------
$ mysql -u root -p

mysql> ^Guse sqlgrey;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed

mysql> ALTER DATABASE sqlgrey CHARACTER SET 'latin1' COLLATE
'latin1_swedish_ci';
Query OK, 1 row affected (0.03 sec)

mysql> SHOW TABLES;
+-------------------+
| Tables_in_sqlgrey |
+-------------------+
| config |
| connect |
| domain_awl |
| from_awl |
| optin_domain |
| optin_email |
| optout_domain |
| optout_email |
+-------------------+
8 rows in set (0.00 sec)

mysql> ALTER TABLE config CHARACTER SET 'latin1' COLLATE
'latin1_swedish_ci';
Query OK, 2 rows affected (0.06 sec)
Records: 2 Duplicates: 0 Warnings: 0

mysql> ALTER TABLE connect CHARACTER SET 'latin1' COLLATE
'latin1_swedish_ci';
Query OK, 299 rows affected (0.09 sec)
Records: 299 Duplicates: 0 Warnings: 0

mysql> ALTER TABLE domain_awl CHARACTER SET 'latin1' COLLATE
'latin1_swedish_ci';
Query OK, 1348 rows affected (0.59 sec)
Records: 1348 Duplicates: 0 Warnings: 0

[... and so on for the other tables ...]
----------------------------------------------------------------

Then I restarted SQLGrey to force it to reconnect. Telnet'ed to port
25 from another machine and sent a fake e-mail by hand using that
"aloïsheinzfoiqu" as a username from the telnetting box's domain
(replaced here as @work.com and with it's IP obfuscated). It still failed:

----------------------------------------------------------------
Jan 20 12:34:18 hobbes sqlgrey: warning: Use of uninitialized value in
concatenation (.) or string at /usr/sbin/sqlgrey line 1154.

Jan 20 12:34:18 hobbes sqlgrey: dbaccess: error: couldn't access
from_awl table:

Jan 20 12:34:18 hobbes sqlgrey: grey: from awl match: updating
208.8.8.8(208.8.8.8), aloï***@work.com(aloï***@work.com)

Jan 20 12:34:18 hobbes sqlgrey: dbaccess: warning: couldn't do query:

Jan 20 12:34:18 hobbes UPDATE: from_awl SET last_seen = NOW(),
first_seen = first_seen WHERE sender_name = 'aloïsheinzfoiqu' AND
sender_domain = 'work.com' AND src = '208.8.8.8':

Jan 20 12:34:18 hobbes Illegal: mix of collations
(latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for
operation '=', reconnecting to DB
----------------------------------------------------------------

Any thoughts? Is it the connection that SQLGrey is making that's UTF-8
typed?

-- Philippe Chaintreuil
Lionel Bouton
2010-01-20 19:25:01 UTC
Permalink
Post by Philippe Chaintreuil
[...]
Any thoughts? Is it the connection that SQLGrey is making that's UTF-8
typed?
Maybe, but SQLgrey doesn't try to force anything, it uses the default
settings.
I'll look in the my.cnf file for any charset/collation set for the
client library.

Lionel
Philippe Chaintreuil
2010-01-20 19:28:57 UTC
Permalink
Sorry, I didn't wait for a response to my last "ALTER DATABASE/TABLE"
post. But I think I did something to fix it, but I don't know if it's a
proper fix or not.

So there's this page about MySQL character sets and connections, which
I skimmed:

http://dev.mysql.com/doc/refman/5.0/en/charset-connection.html

A standard connection to my sqlgrey database offers this:

-------------------------------------------------------------------
mysql> SHOW VARIABLES LIKE 'character_set%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

mysql> SHOW VARIABLES LIKE 'coll%';
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | utf8_general_ci |
| collation_database | latin1_swedish_ci |
| collation_server | utf8_general_ci |
+----------------------+-------------------+
3 rows in set (0.00 sec)
-------------------------------------------------------------------

So it looks like the database is set correctly, but the connection
stuff is not.... If I run:

-------------------------------------------------------------------
mysql> SET NAMES 'latin1';
Query OK, 0 rows affected (0.00 sec)
-------------------------------------------------------------------

Then the same queries look much better:

-------------------------------------------------------------------
mysql> SHOW VARIABLES LIKE 'character_set%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

mysql> SHOW VARIABLES LIKE 'coll%';
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | latin1_swedish_ci |
| collation_database | latin1_swedish_ci |
| collation_server | utf8_general_ci |
+----------------------+-------------------+
3 rows in set (0.00 sec)
-------------------------------------------------------------------

So I tried hacking sqlgrey (I don't know Perl, so this is *really*
dangerous)....

To the connectdb() function, near the bottom there's a MySQL check and
a line of code that turns on DBD's auto-reconnect code, I stuck an extra
line on the end of that to run the SET NAMES command:

-------------------------------------------------------------------
# mysql drops the connection, we have some glue code
# to reinit the connection, but better use mysql DBD code
if ($self->MySQL()) {
$self->{sqlgrey}{dbh}->{mysql_auto_reconnect} = 1;
# Force latin1 character set.
$self->{sqlgrey}{dbh}->do("SET NAMES 'latin1'");
}
-------------------------------------------------------------------

I then did another test e-mail and SQLGrey did not freak out about it;
it greylisted the connection just like it should have.

But I don't know if this is where this should be, or even if it should
be or not.

- Does SQLGrey really need the "latin1" settings?
- Would everything have worked if everything was set to UTF-8?
- Do other databases need this (or something similar)?
- What should happen if that line I added failed?
- Continue anyway with just a warning?
- Should there be a check to make sure the connection and the database
type match?

-- Philippe Chaintreuil
Lionel Bouton
2010-01-20 19:44:35 UTC
Permalink
Post by Philippe Chaintreuil
[...]
But I don't know if this is where this should be, or even if it should
be or not.
- Does SQLGrey really need the "latin1" settings?
Anything that ASCII can map too is fair game.
Post by Philippe Chaintreuil
- Would everything have worked if everything was set to UTF-8?
No : email adresses come in ASCII. Some ASCII code aren't supported by
UTF-8.
I suppose these addresses are invalid.
Post by Philippe Chaintreuil
- Do other databases need this (or something similar)?
Philippe Chaintreuil
2010-01-20 20:54:34 UTC
Permalink
Post by Lionel Bouton
I suppose these addresses are invalid.
Then I've fixed the wrong thing, haven't I?


Should SQLGrey be sterilizing invalid characters to something safe?


Wikipedia says the allowed characters are even smaller than ASCII:

The local-part of the e-mail address may use any of these ASCII characters:

----------------------------------------------------------------------
* Uppercase and lowercase English letters (a-z, A-Z)
* Digits 0 to 9
* Characters ! # $ % & ' * + - / = ? ^ _ ` { | } ~
* Character . (dot, period, full stop) provided that it is not the
first or last character, and provided also that it does not appear two
or more times consecutively.
----------------------------------------------------------------------
http://en.wikipedia.org/wiki/E-mail_address#RFC_specification


Perhaps all characters that aren't one of these should be replaced by a
space-character (' ') or a colon? UTF-8 of this more limited character
set + space + colon would still be a subset of UTF-8, wouldn't it? (Of
course, that thought process is just one based off my problem. I don't
know if that line in the SQLGrey documentation that mentions 'latin1'
was for me as an ASCII person, or for someone whose default character
set is some Chinese character set that isn't a ASCII super-set.)

You might get some collisions between invalid addresses.

I also don't know how that whole International Domain Name thing I hear
about every once in a while plays into it, or how RFC 5335
<http://tools.ietf.org/html/rfc5335> "Internationalized Email Headers"
[Experimental] would.


-- Philippe Chaintreuil
Bruce Bodger
2010-01-20 21:05:54 UTC
Permalink
Post by Philippe Chaintreuil
Should SQLGrey be sterilizing invalid characters to something safe?
If a message passes through PostFix, it should pass through SQLGrey.
"SQLgrey is a postfix policy service..."

B. Bodger
Oklahoma City, OK
Lionel Bouton
2010-01-20 21:10:24 UTC
Permalink
Post by Philippe Chaintreuil
Post by Lionel Bouton
I suppose these addresses are invalid.
Then I've fixed the wrong thing, haven't I?
Should SQLGrey be sterilizing invalid characters to something safe?
That's not its purpose. In the end if the address is invalid, the
mailbox doesn't exist and there's no SPAM problem :-)
Avoiding modifications on the email addresses will make SQLgrey more
future-proof. The more it can treat strings like binary arrays, the less
there will be problems with new standards.

BTW if you want to gain a little speed, you may want to configure
Postfix to reject user names with invalid characters (probably a regex
on the
Philippe Chaintreuil
2010-01-20 21:31:33 UTC
Permalink
Post by Lionel Bouton
That's not its purpose. In the end if the address is invalid, the
mailbox doesn't exist and there's no SPAM problem :-)
These are FROM addresses though: they may exist.
Post by Lionel Bouton
Avoiding modifications on the email addresses will make SQLgrey more
future-proof. The more it can treat strings like binary arrays, the less
there will be problems with new standards.
IDN is using an encoding which can be stored in most charsets, so no
problem here.
So then SQLGrey should be defining it's required character set for the
connection through some mechanism. (SET NAMES 'latin1')
Post by Lionel Bouton
BTW if you want to gain a little speed, you may want to configure
Postfix to reject user names with invalid characters (probably a regex
on the
Kyle Lanclos
2010-01-20 21:26:34 UTC
Permalink
Post by Philippe Chaintreuil
----------------------------------------------------------------------
* Uppercase and lowercase English letters (a-z, A-Z)
* Digits 0 to 9
* Characters ! # $ % & ' * + - / = ? ^ _ ` { | } ~
* Character . (dot, period, full stop) provided that it is not the
first or last character, and provided also that it does not appear two
or more times consecutively.
----------------------------------------------------------------------
Strict adherence to RFC's is good when you control both ends of the
conversation. E-mail, however, is rife with inconsistent implementations
of the relevant RFC's, and as such, you will lose valid messages if your
MTA (or any of its components) requires strict adherence.

I agree with the notion that sqlgrey should restrict itself to solving
one fundamental problem, and solving it well. Let something else worry
about the validity of senders and recipients.

--Kyle
Philippe Chaintreuil
2010-01-21 13:43:28 UTC
Permalink
SQLgrey would have to guess the MySQL version in order to handle this
properly, to make it worthwhile it should check that the database
charset is in a whitelist (or a specific charset).
This would add quite a bit of code (probably buggy as conditions would
be hard to guess), I'd prefer putting configuration recommendations in
the HOWTO but I'm not exactly sure what would work on which version of
MySQL either...
The "MySQL 3.23/4.0/4.1 Manual" page looks pretty much the same as the
5.1 page I referenced yesterday:

http://dev.mysql.com/doc/refman/4.1/en/charset-connection.html

SET NAMES 'latin1' looks like it should work on the older versions.


There's also the fact that MySQL <5.0 is end-of-life and not supported;
the documentation for it will disappear at the end of this year (2010).


-- Philippe Chaintreuil

Loading...