Hello!
First of all sorry for my poor english... I write from Italy...
I'm using full-text search with SQL Express 2005 and I can't undestand a
lot...
This is the situation:
- I'm using Italian word breaker;
- I've removed all words from my noise word list (noiseITA.txt)
- I've these entries in my thesaurus file (tsITA.xml):
<expansion>
<sub>REGGIO NELL'EMILIA</sub>
<sub>REGGIO EMILIA</sub>
<sub>REGGIO DI EMILIA</sub>
</expansion>
- I have these field's value in my records:
id location
-- --
1REGGIO EMILIA
2REGGIO-EMILIA
3REGGIO NELL'EMILIA
4REGGIO DI EMILIA
Case 1 (without thesaurus)
SELECT id FROM table1 WHERE CONTAINS(location , '"REGGIO-EMILIA"')
Result set: 2
- Why not "REGGIO EMILIA" too?
Case 2 (without thesaurus)
SELECT id FROM table1 WHERE CONTAINS(location , '"REGGIO EMILIA"')
Result set: 1, 3
- Why "REGGIO NELL'EMILIA" too?
Case 3 (without thesaurus)
SELECT id FROM table1 WHERE CONTAINS(location , '"REGGIO NELL''EMILIA"')
Result set: 1, 3
- Why "REGGIO EMILIA" too?
Case 4 (without thesaurus)
SELECT id FROM table1 WHERE CONTAINS(location , '"REGGIO DI EMILIA"')
Result set: 4
Case 5 (with thesaurus)
SELECT id FROM table1 WHERE CONTAINS(location , 'FORMSOF
(THESAURUS,"REGGIO-EMILIA")',LANGUAGE 'Italian')
Result set: 2
- Why only "REGGIO-EMILIA" ?
Case 6 (with thesaurus)
SELECT id FROM table1 WHERE CONTAINS(location , 'FORMSOF (THESAURUS,"REGGIO
EMILIA")',LANGUAGE 'Italian')
SELECT id FROM table1 WHERE CONTAINS(location , 'FORMSOF (THESAURUS,"REGGIO
NELL''EMILIA")',LANGUAGE 'Italian')
SELECT id FROM table1 WHERE CONTAINS(location , 'FORMSOF (THESAURUS,"REGGIO
DI EMILIA")',LANGUAGE 'Italian')
Result set: 1, 3, 4
- Why not "REGGIO-EMILIA" too?
I can't understand... is the hyphen (-) a word breaker sign?
What shall I do to obtain all the four records when I search for each of the
locations above ?
It doesn't work fine if I add "REGGIO-EMILIA" in the thesaurus file... :-(
Thank you very much!!!!
Paola
YOUR PROBLEM IS THAT YOUR CONTENT IS ALL UPPERCASE. THE ITALIAN WORD
BREAKER WILL BREAK REGGIO-EMILIA AS ONE WORD. HOWEVER IT WILL BREAK
Reggio-Emilia AS TWO WORDS. HAVE A LOOK AT THIS:
create database italian1
go
use italian1
GO
create fulltext catalog italian as default
go
create table italian (id int not null identity constraint italianPK primary
key, location varchar(30))
GO
create fulltext index on italian(location language italian) key index
italianpk
GO
insert into italian (location) values('REGGIO EMILIA')
insert into italian (location) values('REGGIO-EMILIA')
insert into italian (location) values('REGGIOEMILIA')
insert into italian (location) values('REGGIO NELL''EMILIA')
insert into italian (location) values('REGGIO DI EMILIA')
insert into italian (location) values('Reggio Emilia')
insert into italian (location) values('Reggio-Emilia')
insert into italian (location) values('Reggioemilia')
insert into italian (location) values('Reggio Nell''Emilia')
insert into italian (location) values('Reggio nell''Emilia')
insert into italian (location) values('Reggio Di Emilia')
insert into italian (location) values('Reggio di Emilia')
GO
SELECT id FROM italian WHERE CONTAINS(location , '"REGGIO-EMILIA"')
Result set: 2
-- Why not "REGGIO EMILIA" too?
--all uppercase will consider the word as one token
SELECT id FROM italian WHERE CONTAINS(location , '"Reggio-Emilia"')
--returns, 1, 3, 5, 6, 7, 8, 9, & 10
Case 2 (without thesaurus)
SELECT id,location FROM italian WHERE CONTAINS(location , '"REGGIO EMILIA"')
Result set: 1, 3
-- Why "REGGIO NELL'EMILIA" too? --nell is a noise word and as such is
discarded in the search
SELECT id FROM italian WHERE CONTAINS(location , '"Reggio Emilia"')
--returns, 1, 3, 6, 7, & 10
Case 3 (without thesaurus)
SELECT id,location FROM italian WHERE CONTAINS(location , '"REGGIO
NELL''EMILIA"')
Result set: 1, 3
-- Why "REGGIO EMILIA" too? nella is a noise word and thrown away
Case 4 (without thesaurus)
SELECT id,location FROM italian WHERE CONTAINS(location , '"REGGIO DI
EMILIA"')
Result set: 4
Case 5 (with thesaurus)
SELECT id FROM italian WHERE CONTAINS(location , 'FORMSOF
(THESAURUS,"REGGIO-EMILIA")',LANGUAGE 'Italian')
Result set: 2
-- Why only "REGGIO-EMILIA" ? Cause its uppper case, try this
SELECT id FROM italian WHERE CONTAINS(location , 'FORMSOF
(THESAURUS,"Reggio-Emilia")',LANGUAGE 'Italian')
--1, 3, 5, 6, 7, 8, 9, & 10 are returned
Case 6 (with thesaurus)
SELECT id FROM italian WHERE CONTAINS(location , 'FORMSOF (THESAURUS,"REGGIO
EMILIA")',LANGUAGE 'Italian')
SELECT id FROM italian WHERE CONTAINS(location , 'FORMSOF (THESAURUS,"REGGIO
NELL''EMILIA")',LANGUAGE 'Italian')
-- I can't figure out case 6 however
Hilary Cotter
Director of Text Mining and Database Strategy
RelevantNOISE.Com - Dedicated to mining blogs for business intelligence.
This posting is my own and doesn't necessarily represent RelevantNoise's
positions, strategies or opinions.
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"paola" <paola@.discussions.microsoft.com> wrote in message
news:FF634D5E-B3B9-412A-89AC-CA4C10F1A364@.microsoft.com...
> Hello!
> First of all sorry for my poor english... I write from Italy...
> I'm using full-text search with SQL Express 2005 and I can't undestand a
> lot...
> This is the situation:
> - I'm using Italian word breaker;
> - I've removed all words from my noise word list (noiseITA.txt)
> - I've these entries in my thesaurus file (tsITA.xml):
> <expansion>
> <sub>REGGIO NELL'EMILIA</sub>
> <sub>REGGIO EMILIA</sub>
> <sub>REGGIO DI EMILIA</sub>
> </expansion>
>
> - I have these field's value in my records:
> id location
> -- --
> 1 REGGIO EMILIA
> 2 REGGIO-EMILIA
> 3 REGGIO NELL'EMILIA
> 4 REGGIO DI EMILIA
>
> Case 1 (without thesaurus)
> SELECT id FROM table1 WHERE CONTAINS(location , '"REGGIO-EMILIA"')
> Result set: 2
> - Why not "REGGIO EMILIA" too?
> Case 2 (without thesaurus)
> SELECT id FROM table1 WHERE CONTAINS(location , '"REGGIO EMILIA"')
> Result set: 1, 3
> - Why "REGGIO NELL'EMILIA" too?
> Case 3 (without thesaurus)
> SELECT id FROM table1 WHERE CONTAINS(location , '"REGGIO NELL''EMILIA"')
> Result set: 1, 3
> - Why "REGGIO EMILIA" too?
> Case 4 (without thesaurus)
> SELECT id FROM table1 WHERE CONTAINS(location , '"REGGIO DI EMILIA"')
> Result set: 4
> Case 5 (with thesaurus)
> SELECT id FROM table1 WHERE CONTAINS(location , 'FORMSOF
> (THESAURUS,"REGGIO-EMILIA")',LANGUAGE 'Italian')
> Result set: 2
> - Why only "REGGIO-EMILIA" ?
> Case 6 (with thesaurus)
> SELECT id FROM table1 WHERE CONTAINS(location , 'FORMSOF
> (THESAURUS,"REGGIO
> EMILIA")',LANGUAGE 'Italian')
> SELECT id FROM table1 WHERE CONTAINS(location , 'FORMSOF
> (THESAURUS,"REGGIO
> NELL''EMILIA")',LANGUAGE 'Italian')
> SELECT id FROM table1 WHERE CONTAINS(location , 'FORMSOF
> (THESAURUS,"REGGIO
> DI EMILIA")',LANGUAGE 'Italian')
> Result set: 1, 3, 4
> - Why not "REGGIO-EMILIA" too?
>
> I can't understand... is the hyphen (-) a word breaker sign?
> What shall I do to obtain all the four records when I search for each of
> the
> locations above ?
> It doesn't work fine if I add "REGGIO-EMILIA" in the thesaurus file...
> :-(
> Thank you very much!!!!
> Paola
>
>
|||Ok, thanks a lot Hilary...
I'm doing some tests to undestand... and so... why the query:
SELECT * FROM italian WHERE CONTAINS(location ,
'"Reggio Emilia"',LANGUAGE 'Italian')
doesn't return "Reggio di Emilia" ?
And :
SELECT * FROM italian WHERE CONTAINS(location ,
'"Reggio di Emilia"',LANGUAGE 'Italian')
returns only:
REGGIO DI EMILIA
Reggio Di Emilia
Reggio di Emilia
"di" is a noise word, isn't it?...
And so I suppose that "di" is ignored in search string and in field value
too. Why not?
Thanks a lot...
Paola
"Hilary Cotter" wrote:
> YOUR PROBLEM IS THAT YOUR CONTENT IS ALL UPPERCASE. THE ITALIAN WORD
> BREAKER WILL BREAK REGGIO-EMILIA AS ONE WORD. HOWEVER IT WILL BREAK
> Reggio-Emilia AS TWO WORDS. HAVE A LOOK AT THIS:
> create database italian1
> go
> use italian1
> GO
> create fulltext catalog italian as default
> go
> create table italian (id int not null identity constraint italianPK primary
> key, location varchar(30))
> GO
> create fulltext index on italian(location language italian) key index
> italianpk
> GO
> insert into italian (location) values('REGGIO EMILIA')
> insert into italian (location) values('REGGIO-EMILIA')
> insert into italian (location) values('REGGIOEMILIA')
> insert into italian (location) values('REGGIO NELL''EMILIA')
> insert into italian (location) values('REGGIO DI EMILIA')
> insert into italian (location) values('Reggio Emilia')
> insert into italian (location) values('Reggio-Emilia')
> insert into italian (location) values('Reggioemilia')
> insert into italian (location) values('Reggio Nell''Emilia')
> insert into italian (location) values('Reggio nell''Emilia')
> insert into italian (location) values('Reggio Di Emilia')
> insert into italian (location) values('Reggio di Emilia')
> GO
> SELECT id FROM italian WHERE CONTAINS(location , '"REGGIO-EMILIA"')
> Result set: 2
> -- Why not "REGGIO EMILIA" too?
> --all uppercase will consider the word as one token
> SELECT id FROM italian WHERE CONTAINS(location , '"Reggio-Emilia"')
> --returns, 1, 3, 5, 6, 7, 8, 9, & 10
> Case 2 (without thesaurus)
> SELECT id,location FROM italian WHERE CONTAINS(location , '"REGGIO EMILIA"')
> Result set: 1, 3
> -- Why "REGGIO NELL'EMILIA" too? --nell is a noise word and as such is
> discarded in the search
> SELECT id FROM italian WHERE CONTAINS(location , '"Reggio Emilia"')
> --returns, 1, 3, 6, 7, & 10
> Case 3 (without thesaurus)
> SELECT id,location FROM italian WHERE CONTAINS(location , '"REGGIO
> NELL''EMILIA"')
> Result set: 1, 3
> -- Why "REGGIO EMILIA" too? nella is a noise word and thrown away
> Case 4 (without thesaurus)
> SELECT id,location FROM italian WHERE CONTAINS(location , '"REGGIO DI
> EMILIA"')
> Result set: 4
> Case 5 (with thesaurus)
> SELECT id FROM italian WHERE CONTAINS(location , 'FORMSOF
> (THESAURUS,"REGGIO-EMILIA")',LANGUAGE 'Italian')
> Result set: 2
> -- Why only "REGGIO-EMILIA" ? Cause its uppper case, try this
> SELECT id FROM italian WHERE CONTAINS(location , 'FORMSOF
> (THESAURUS,"Reggio-Emilia")',LANGUAGE 'Italian')
> --1, 3, 5, 6, 7, 8, 9, & 10 are returned
> Case 6 (with thesaurus)
> SELECT id FROM italian WHERE CONTAINS(location , 'FORMSOF (THESAURUS,"REGGIO
> EMILIA")',LANGUAGE 'Italian')
> SELECT id FROM italian WHERE CONTAINS(location , 'FORMSOF (THESAURUS,"REGGIO
> NELL''EMILIA")',LANGUAGE 'Italian')
> -- I can't figure out case 6 however
> --
> Hilary Cotter
> Director of Text Mining and Database Strategy
> RelevantNOISE.Com - Dedicated to mining blogs for business intelligence.
> This posting is my own and doesn't necessarily represent RelevantNoise's
> positions, strategies or opinions.
> Looking for a SQL Server replication book?
> http://www.nwsu.com/0974973602.html
> Looking for a FAQ on Indexing Services/SQL FTS
> http://www.indexserverfaq.com
>
> "paola" <paola@.discussions.microsoft.com> wrote in message
> news:FF634D5E-B3B9-412A-89AC-CA4C10F1A364@.microsoft.com...
>
>
|||Because it knows there is something in-between. Try this
insert into italian (location) values('Reggio XX Emilia')
and now this
SELECT * FROM italian WHERE CONTAINS(location ,
'"Reggio di Emilia"',LANGUAGE 'Italian')
you will see the hit to
Reggio di Emilia
and
Reggio XX Emilia
Hilary Cotter
Director of Text Mining and Database Strategy
RelevantNOISE.Com - Dedicated to mining blogs for business intelligence.
This posting is my own and doesn't necessarily represent RelevantNoise's
positions, strategies or opinions.
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"paola" <paola@.discussions.microsoft.com> wrote in message
news:7319F1AA-1CFC-42BF-9A70-F81C38E70840@.microsoft.com...[vbcol=seagreen]
> Ok, thanks a lot Hilary...
> I'm doing some tests to undestand... and so... why the query:
> SELECT * FROM italian WHERE CONTAINS(location ,
> '"Reggio Emilia"',LANGUAGE 'Italian')
> doesn't return "Reggio di Emilia" ?
> And :
> SELECT * FROM italian WHERE CONTAINS(location ,
> '"Reggio di Emilia"',LANGUAGE 'Italian')
> returns only:
> REGGIO DI EMILIA
> Reggio Di Emilia
> Reggio di Emilia
> "di" is a noise word, isn't it?...
> And so I suppose that "di" is ignored in search string and in field value
> too. Why not?
> Thanks a lot...
> Paola
>
>
>
> "Hilary Cotter" wrote:
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment